Day #2 – Data Modeling

Day #2 of my course involved getting over view of data modelling. The course started off with basic introduction courses for relational and cloud databases. The course per se was touching only on basic terms and bit underwhelming w.r.t to intro to PostgreSQL database.

Cassandra database is the next hurdle to cross and work on.

Day #1 – Getting the engine started

Yesterday marked my first day in the Data Engineer Nanodegree course offered by Udacity. After thinking a lot on how to best equip myself and enrich my knowledge in the world of Big Data and taking the steps towards it, this course came along talking about THE essential things that I wanted to learn – Cloud data warehouses, Spark and Data Lakes.

What further sealed the deal was I am working on a project where we are using Spark and Data Lake as well. However, it is being handled by a separate team. My involvement so far has been to the extent of writing Impala queries, creating data structure, testing the sqoop queries and occasionally query tuning by looking at the logs to understand which partitioning is better. I reasoned that doing this course will give me a better ammo to pitch myself to get into the Data Lake team. Time will tell (fingers crossed)

I have been longing for an opportunity to pivot my career from the traditional BI to Data Engineering on Big Data Platforms. Here is a course that not only promises to teach the nitty gritties of being a Data Engineer with a proper structured methodical teaching but also help with shaping up my career via services like resume editing and LinkedIn page setup. Long way to go for that.

So here is what my Day 1 (yesterday) felt like so far- Absolutely wonderful!
In the first few videos I have really gotten to know what Data Engineer really means and what other titles actually mean and how they stack up.

What resonated me a lot was this article that was one of the materials to read up. It spoke volumes to me as this was exactly the path I had been following all through my career. I started off writing ETL packages via SSIS on traditional OLTP – OLAP databases, designing cubes off of it, designing and developing reports based on it.

All these have stopped about 3 years ago and it was only a year ago, I am completely off it. I am now working on data sources which are disparate in nature or are built on the Data Lake. This is a brand new world for me and am loving every part of it. The challenges are different, more exciting and there is SO much more to be done.

Looking at the evolution of how data has proliferated and how the traditional RDBMS technologies are not sufficient to cater the growing needs of business, I am happy to see the organic growth in me. Of course, to be where I am today, the forces that have shaped me are largely due to the work done in BI but stepping into new future I need more ammo.

Coming back to the course, I started off with Data Modelling basics and some intro into PostgreSQL.

Next post would be more structured. The purpose of this post with # tag is to motivate myself to read every day and share my thoughts on my learning.

Alteryx App- Adding (Select All) to Dropdown

One of the most common requirement that comes when creating Alteryx Apps involving a dropdown is to have (Select All) as one of the values. This value, if you have not inferred by now, would not be part of a data source but something we add to it. Basically I am trying to simulate what Excel does when ‘Sort & Filter’ is enabled as seen below – 

2019-02-07 11_39_50-Marvel Movies Box Office Report.xlsx - Excel

In this post I am going to demonstrate how to add this value in the drop down and how it then needs to be consumed in filtering the datasource. This post will have the following sections –

  1. Get Dummy Data
  2. Add (Select All) to Dropdown tool
  3. Filter dataset using the ‘(Select All)’

1. Get Dummy Data

The test data which I am going to use for this post is the highest grossing Marvel Movies which I am getting from this link – Marvel Comics Movies at Box Office. I copied the first table and stored it in my local disk. In my Alteryx App, I have dragged this data as a Input Tool. Used DateTime Tool and Filter to create a unique Date Format.

Sample Data is shown below –

2. Create Filter with (Select All) in it

As next step I am going to create one Dropdown filter – ‘Studio’. For this post I am choosing the option – Manually set values (Name:Value – one per line). I have entered the values manually. As you can see I have kept the option ‘(Select All)’ as the first entry. 

2019-02-07 11_55_37-Alteryx Designer x64 - Test Workflow - Blog.yxwz_.png

If the data is coming via connected tool, ensure that source data for the filter is joined with the ‘(Select All)’ manual text data. Likewise, if it’s coming from external source, add this entry. Essentially because this value doesn’t exist, we would need to add it.

3. Filter dataset using the ‘(Select All)’

We identified our column on which we need to filter. We have created the values for the filter. It is now time to actually ‘filter’. I have now dragged in a Filter transform. Too much repetitive, isn’t it. I will now ‘filter’ (pardon the pun) going forward. Put the following entry for the ‘Customer Filter’ option as shown below –

2019-02-07 12_05_27-Alteryx Designer x64 - Test Workflow - Blog.yxwz_.png

Here is what we need to do next –
1. Connect the ‘search’ icon of ‘Studio’ dropdown to the lightning bolt of ‘Filter on Studio’ Operator. A ‘Update value’ operator pops up in between.
2. Click on the operator. Under the ‘Value or Attribute to Update:’ section, click on “Expression – value = ….”
3. Click on the checkbox ‘Replace a specific string:’ and just keep the value as ‘<studio>’ (without quotes) as can be seen below- 

2019-02-07 12_14_02-Alteryx Designer x64 - Test Workflow - Blog.yxwz_.png

That’s it. Essentially the logic here is the inclusion of OR expression of place holder value equal to ‘(Select All)’. When a specific filter is selected, the select all expression goes false and when ‘(Select All)’ option is selected, well this one gets true and essentially the filter will flow through all the data.

Same logic can be used for filtering the data say in a Dynamic Input with data coming from a particular SQL data source. Code for that would look like this – 

select *
from  really_awesome_table2019-02-07 14_18_14-Alteryx Designer x64 - Select All - Filter implementation.yxwz.png
where (
           that_parameter  = ‘<this_value>’
            or
           ‘(Select All)’ = ‘<this_value>’
          )

To test it out I have put in a summarize transform to group by ‘Studio’ and ‘Total Movies’. Let’s run the workflow.  Here is how the wizard looks like –

2019-02-07 14_20_04-Alteryx Designer x64 - Select All - Filter implementation.yxwz.png
Here is how the output looks like, executing with (Select All) and individual Filter –



IntelliJ – New Project – Scala not appearing

Have you come across the problem of not being able to select ‘Scala’ as part of the ‘New Project’ in IntelliJ as shown below (psst – there is Scala present, I know. For imagination sake let’s say you don’t see it, kapise?)  ?

IntelliJ-New Project-Window.png

First thing when you google, everything it points out is for you to installing ‘Scala plugin’.

Let’s say you do have the plugin installed and available and enabled as shown below but even then you don’t find it in the ‘New Project’ –
IntelliJ-Plugins-Scala-Enabled.png
This usually happens say when you have upgraded your IntelliJ or re-installed your IntelliJ with a different version at different time and have imported the old settings. The best thing to do to get over this error is navigate to the folder – C:\users\<username>\ and delete all the folders starting with .IdeaICxxxx as highlighted below –
IntelliJ-User Folder-Delete.png

That’s it. Now open up your IntelliJ again, all should be good now to get going. Of course this time around you would need to get your JDK and Scala libraries set up. Follow this blog – link

 

Alteryx – PostgreSQL – Out of memory while reading tuples

While working on one of the workflows of Alteryx with PostgreSQL data sources, I started getting this wierd error –

Error SQLExecute: Out of memory while reading tuples.; memory allocation error???

This was a unusual error. For one I was running this not on my local machine but on the Gallery which was governed by the server. Also I felt Alteryx folks could have done away with three question marks (???). It feels as if we are being questioned on the memory allocation and it’s just waiting for some confirmation.

This is one of those classic it-works-in-dev-not-in-prod kind of problem. I then went back and compared the two ODBC data source settings across both the environments. It was here I could see the problem. You need to set the setting for PostgreSQL ODBC data source to have ‘Use Declare/Fetch’ option set in.

Open up the ‘ODBC Data Source Administrator’ window after running it as administrator. Click on the ‘Data Source’ that you are using and click on ‘Configure’. In the ‘PostgreSQL ANSI ODBC Driver (psqlODBC) Setup’ window, click on ‘Datasource’. Ensure that the checkbox ‘Use Declare/Fetch’ is checked and then click on ‘OK’ and close the subsequent windows. This setting resolves the issue. Below is the screenshot

PostgreSQL data source setting

 

Running dynamic Powershell script using Alteryx

Figuring out how to run Powershell scripts using Alteryx has been a very harrowing experience for me. I had used ‘Run command’ tasks before but I remember even then it was quite a struggle to get it up and running.

The help files from Alteryx with regards to this task is abysmal. It gives an explanation of each task within the command but doesn’t really tell you how to go about doing it step by step. The nearest help from Alteryx forum that I could get is this. Here is my attempt to make it easier for folks who find themselves in similar situation.

Task Description
I need to create a sub-folder under multiple folders. This ‘sub-folder’ name is dynamically determined during run-time depending on the input file name to the workflow.

For example, say my input file is – movies_released_20180801.csv. I have a set of folders under the parent directory – Movies as shown below –
Powershell_Input_Folder_Structure

I need to create a folder called ‘20180801’ under each of the sub-folders, if it doesn’t exist already. As each month passes, new sub-folders would need to be created.

Solution
My first step for the solution is to create a powershell script that can perform the sub-folder creation. Here is my script –

#Get Parent directory
$root = "C:\Users\kvankada\Documents\Personal\Learning\output\Movies"

#Get all directories present in the parent folder
$folders = Get-ChildItem -Path $root -Directory

#Iterate through each item within the list of parent folder
foreach($folder in $folders)
{
  #Get the subfolder full path
  $Subfolder = $folder.FullName + "\20180801"

   #Check to see if the folder exists
   if(! (Test-Path -Path $Subfolder))
   {
      #If it does not exist, create new sub-folder under each item
      New-Item -ItemType directory -Path $Subfolder
   }
}

The root folder of the above script file comes from the Alteryx workflow. The next steps now starts looking crazy but bear with me. It will all make sense. Drag in a ‘Input Task’ with following two columns –

Path – C:\Users\kvankada\Documents\Personal\Learning\output\create_subfolders.ps1
Input – C:\Users\kvankada\Documents\Personal\Learning\input\movies_released_20180801.csv

The Path should be a valid path and accessible to the user. It can be a UNC path as well. Drag in a ‘Formula’. Create two new columns Subfolder and Powershell_Script with following values as shown below –

Powershell_Formula_Task_powershell

As can be seen the new column Powershell_Script is nothing but the entirety of powershell script with ‘Subfolder’ being dynamically sent. Drag in a ‘Select’ task. De-select all the fields excepting ‘Path’ and ‘Powershell_Script’. Drag in a ‘Run Command’ task. Click on ‘Output’ under ‘Write Source [Optional]’ in the configuration window as shown below –
Powershell_Run_Command_Task_Write_Source.png
In the ‘Write to File or Database’ file, click on the drop-down and select ‘File’. A ‘Save a data file’ window pops up. Here chose any location and save the file as say ‘temp_powershell.csv’. I have chosen a ‘temp’ folder as my file location. In the ‘Alterx Designer’ the following changes still needs to be done against following settings –
Delimiters – \0
First Row Contains Field Names – Unchecked
Quote Output Fields – Never
Take File/Table Name From Field – Tick the checkbox. Chose the option as ‘Change Entire File Path’
Field Containing File Name or Part of File Name – Path
Keep Field in Output – Unchecked

The final dialog would look like as shown below. Changes have been highlighted. Click on ‘OK’ post completion –
Powershell_Run_Command_Task_Write_Source_2
The reason we need to do this is because Alteryx doesn’t provide a way to save the output to Powershell script file. We need to get the content of Powershell first saved as CSV and then the whole file being renamed via our ‘Path’ value that we have provided in the workflow.  It’s a roundabout way of doing things.

Continuing the ‘Run Command’ Task configuration and set them as shown below.
Pay particular attention to double quotes in the ‘Command Arguments [Optional]’ section. It is absolutely necessary.-
Powershell_Run_Command_Task_Final

That’s it. Now you are set to run. There is no need to configure the output of ‘Run Command’ task. If you do need to use, then ‘Read Results’ path needs to be configured. I might do a continuation post later on for that. For now you can ignore the need for additional output configuration. Here is how the final workflow looks like –

Powershell_Workflow_Overview_Final.png
I ran the workflow and here is my output –
Powershell_Final_Output
For final workflow in the ‘Run Command’ task, you would need to set ‘Run Minimized’ and ‘Run Silent’ check boxes to have smoother run. As you are testing out, it would be important for debugging purpose to leave them unchecked. In that manner you would see if you have any errors popping out.

That’s it. That’s how it is done in Alteryx. It took me quite a lot of tinkering and experimenting to get the desired result. Hope this post has helped in solving your own problems while getting Powershell scripts run via Alteryx.

 

Alteryx InboundNamedPipe::ReadFile: Not enough bytes read error – Postgres

Yet another strange day to be in. I kept getting the following error –

Error – The Designer x64 reported: InboundNamedPipe::ReadFile: Not enough bytes read. The pipe has been ended¶

I tried several methods – clearing of temp files, restarting alteryx, restarting my system multiple times, logging on and logging off etc to no avail. It just kept failing.

What added to the confusion is, it was happening only for few tables whose volumes are much smaller in size. I mean to say Table 1 with about ~20M records from the same database, we were able to extract successfully whereas from Table 2 with only ~3.5M we were facing this problem. Seemed really strange. The good old internet hadn’t turned up with anything useful and neither the Alteryx Forums. All the info that I got was that during one stage of workflow , it was running into error.

The case for me though is in my workflow apart from the ‘Input Tool’ from where I was pulling in the data and the ‘Output Tool’, there was nothing in between. Still it was failing.

Anyway I just took a break, thought for a while and said to myself – “When was the last time you had such strange error with Alteryx and Postgres combo?” Oh yeah, right here.

Hmm, why not try the same fix? I sure did and you know what! It just god darn works! Don’t even ask me how or why, it works. So gents and ladies, here is the fix you gotta do –

Go to the, Pre SQL Statement of your Alteryx ‘Input Tool’ and insert the following –

set client_encoding to ‘UTF-8’

As simple as that! Thank me later.