Efficient Schema Extraction in Databricks with Pyspark: A Step-by-Step Guide

April 4, 2023June 1, 2023 Karthik VLeave a comment

Extracting Schema information from Databricks would seem to be a very simple solution, isn’t it? I mean there is INFORMATION_SCHEMA to use. That is unfortunately only applicable only to Unity Catalog i.e. a databricks metastore that can share data across multiple Azure Databricks workspaces. Unfortunately no such catalog exists in our project. Best recourse for this is by using Python Spark SQL libraries.

Problem

In one of the recent tasks, I had to update a large SQL query comprised of 10+ Databricks tables with additional attributes and replace some existing ones. The only problem is most of the attributes had no aliases to them. So, there was no way to know which attribute is from which table. It’s one of the most irritating and annoying things that bothers me to no end and feel like…

Solution

Dynamically read Zip file contents using Alteryx

September 7, 2020April 9, 2021 Karthik VLeave a comment

I had an interesting business problem to solve and wanted to share on how this can be achieved.

Business Problem

On daily basis a zip file containing various flat files is dropped at a file location. Contents of the flat files are to be read and extracted. All the files have the same metadata.

File names within the zip file are dynamic.

Solution

Alteryx provides out of the box Input tool for working with zip files. All one needs to do is drag and drop the zip file on to the canvas and tool itself will pop-up asking which files to be extracted as shown below.

Karthik's BI Musings

Category: Python

Efficient Schema Extraction in Databricks with Pyspark: A Step-by-Step Guide

Problem

Solution

Dynamically read Zip file contents using Alteryx