Incremental Data Load using Alteryx

As I keep progressing on my journey with Alteryx, I can’t help keep thinking on exploring all the various ways with which this can be stretched and explored. Here are some of the ETL scenarios that I have in mind –

  1. Incremental data loads
  2. Loading files from folder for a certain date range and moving them to archive
  3. Error handling and logging
  4. Performance handling (loading ~1M records)
  5. Configurability and deployment across various environments
  6. Version control

These are what I can think of now. Let me start off this post with the first of the above scenarios i.e. incremental data loads. This is a quite common ETL scenarios that every developer gets to work on over course of his career.

Here is a sample staging table from along with sample data in it. This would become the source table –

create table stg_orders(
id int,
description varchar(10),
amount decimal(6,2),
created_date date
);

insert into stg_orders values (1,'Apples',20,'2017-12-15');
insert into stg_orders values (2,'Bananas',10,'2017-12-15');
insert into stg_orders values (3,'Mangoes',25,'2017-12-15');

Here is the destination table to which data is being loaded.

create table orders(
id int,
description varchar(10),
amount decimal(6,2),
created_date date
);

Incremental loading refers to the process of loading only changed data from source to destination. Identification of this ‘changed’ data would vary based on the business requirements. Most often it is chosen based on the datetime field, which is the case here.

First thing to ensure is to have a controller table at the destination that will hold the last loaded datetime. Below is the controller table that I am using for this demo –

 

create table controller(
last_refreshed_date date
);

With basic structure in place, let’s begin to assemble the workflow. Here is how the final output would look like –
1_Workflow
It’s THAT simple. Just three little transforms. That’s the beauty of Alteryx. All the nitty-gritty itsy bitsy column details are all completely hidden. Let’s look at each one of them. The first controller is a ‘Input Tool’ that gets the max last refreshed date value from the existing data.  Here is the query used for it. When the workflow is first run, the controller would not be having any data in it, hence I have put an arbitary date of 12th December, 2017. Ideally this value would be current date.-

select coalesce(max(last_refreshed_date),'2017-12-12') as last_refreshed_date
from controller;

The second tool is a Dynamic Input tool – ‘Get Orders data’ wherein the filter gets added on and requires bit more explanation.  Once you drag and drop Dynamic Input tool, here are the steps that are to be taken as shown below.

2_DynamicInput_SQLEditor

Click on ‘Edit’ button. In the ‘Connect to a File or Database’ click to obtain the source query and then in the SQL Editor, put the actual source query that fetches the data from the data from the source. As can be seen the date that I specified is an arbitary date. Once you put in the query click on ‘Ok’ in the current and subsequent window to close the wizard. Connect this to the previous transform i.e. the ‘Get Max Last Refreshed Date’ Input tool.
3_DynamicInput_ModifySQLQuery

At this point you are just connected to the source SQL query and we now need to configure it so that filter is udpated. Now select  ‘Modify SQL Query’ and click on ‘Add’ dropdown and choose ‘SQL: Update WHERE Clause’. This will pop-up a new window

This is where you can say ‘magic’ happens.  As you see above, the first part is where you need to select the appropriate part of the where clause. If you have multiple clauses in your WHERE clause, all of these appear in the drop-down. The wizard automatically reads in the filter and identifies the field that needs to be replaced. The second part is where you define the field with which you will be replacing, which in our case is the ‘last_refreshed_date’ and then click on ‘OK’.

4_DynamicInput_SQLWhereClause

Let’s get to the last part i.e. ‘Output Data’ tool – Load data into Orders and update controller table. This control has among many options – ‘Post Create SQL Statement’ which is what we are going to use. Here is the sql statement that I have put in it for –

BEGIN TRANSACTION;

INSERT INTO controller(
last_refreshed_date
)
SELECT MAX(created_date)
FROM    orders;

END TRANSACTION;

This is a feature that even informatica has but only as part of the source. Here in Alteryx, both ‘Input Data’ and ‘Output Data’ has this option. Here is how the output properties would look like –
6_OutputData_Properties

With all set let’s run the workflow. Here is the data from the orders table post-run –

7_Orders_Output
Here is the data in ‘Controller’ table –
8_Controller_Output_Run1

Let’s insert some more data in the source table –

insert into stg_orders values (4,'Apples',30,'2017-12-16');
insert into stg_orders values (5,'Bananas',2,'2017-12-17');

In this run, you can see that only 2 records get picked which is what is expected and here is the data from the orders table –
9_Orders_Output_Run2

That’s about it. Here are few points that I would like to point out –

1. In a typical ETL, the controller table would have few more columns such as Id, Total Records etc. basically a auditing table.
2. The source SQL Query may not be as simple as the one that is pasted in this demo but a complex one especially when loading data from a Operational Data Store (ODS) to a Data Warehouse (DW).
3. At no point of team we needed any variable to perform all these operations.

 

Alteryx – First and Lasting impressions

If not for the new job that I have joined I would not have heard of this tool at all. Alteryx is another kid on the block that is mix of ETL / Analysis / Reporting all blended into one. The reporting part hasn’t impressed me much but some of the the ETL/ Analsyis features that it has just blew me away. In this post I would like to point out few key areas where I feel this just aces miles ahead of SSIS (or) Informatica.

Installation
You just need to go to the site, click on the Download Now. This would then prompt you to register and you MUST enable to their subscriptions. Once done, you would have 14 day trial of the product. It’s that simple.

Preview Data at every single stage post run
At each stage of transformation one can see the data before and after the transformation post the execution of the workflow. I just can’t fathom how are they even doing this. Let me show you with an example

Here is a simple workflow that is taking student data as input, doing some check on Gender (if Male/Female) and then cocatenating first name and last name for both the flows.

1_StudentImport_Overview

If you are coming with a ETL background, you would be able to quickly latch on to the transformations and what they do because they are so intuitive. Even an hours video from youtube would be sufficient to quickly scale up on transformations. If you notice each of the transformations above, there are green button like icons before and after. They are basically input and output (as if you didn’t decipher already). Now let’s say the workflow is run. Post run, I can click on any of these buttons to see the data at that particular stage.

Let me pick ‘Identifiy Gender’ transformation. Once you click on the transformation, all the inputs and outputs of that transformation are available to be previewed. Seen below is how the input data looks like. My condition was to seperate them by Gender –
2_IdentifyGender_Input
If I want to see what the ‘T’ output rows look like (i.e. Males), I just click on it –

2_IdentifyGender_True
Now let me have a look at the ‘F’ output rows –
3_IdentifyGender_False
Imagine this being the case at every stage of the transformation. It’s just incredible to be able to see how your business rules are working at every stage.
Let me just repeat one more time, if you haven’t read the sub heading, this preview is POST-RUN. I just can’t think of any alternative for this in either SSIS/Informatica.

Testing the DFT without Output-
Just scroll little back up and see the screenshot of the workflow I posted. It doesn’t contain an ‘Output’. The last transformation that you see is just a UNION ALL. Say if you are developing a POC or just testing somethings out, you avoid the necessity to create a destination and then dumping the data. Of course, Trash Destination comes quickly to mind from SSIS stack but that’s an add-on and not out-of-the-box feature. I can’t think of any in Informatica though.

Multicasting
Pretty much every single transformation’s output you can multicast and then branch off to do some entirely new logic altogether.

Dynamic column propogation-
I have been saving the best for the last. It has this incredibly advanced capability of bringing in dynamic columns just as if they have been there all along.  Let’s say in the data above, I perform two changes to the input file –
-Added new column say Location at the end
-Added new column Is Married after ‘Last Name’ column

Without doing any changes to the workflow, it just runs without throwing any error and here is my output from ‘Union’ transform –

4_ColumnAddition_WorkflowRun

HOLY COW! It’s just mind-blowing, ain’t it. In terms of data modifications, it was pretty drastic, as in new columns were added not just at the end but also in between, and Alteryx just doesn’t care about it. It just works!

I am sure, I am just scraping through the tip of the iceberg and there is HUGE amount of exploration left to do. What is also fantastic about this product is the community behind it.

Their community forums and learning channels are all free for anyone to ask and learn much like MSDN community or Informatica ones. They have weekly challenges running which are good fun to flex your muscle and give it a try. The whole interface though I feel they can improve on. I feel bit claustrophobic with all the overbearing green color theme and design but you get used to it.

All in all I am loving it. Watch out for future posts where I detail how it fares performance wise, error handling, configurability, looping, dynamic data parsing etc.

Excel Import date getting recoginzed as month

I had a really wierd error this morning when I wanted to import some sample data from CSV into excel. One of the columns in my data was a date column. Upon import here is how the data got displayed when I clicked on the ‘Filter’ column. All the data that I had in it were of October 2017 only but the filter was showing something else –

Incorrect Date Filter

The date value in the column is in the format – dd/mm/yyyy but clearly excel was thinking otherwise i.e. mm/dd/yyyy. I knew right away I had to do some settings change but number of changes I had to do was a lot and I thought it’s better to document it.

Click on Start and go ‘Date & time Settings’. On the far right-hand side, under ‘Related settings’ click on ‘Additional date, time, & regional settings’ as shown below –
AdditionalDateTimeSettings
Under Region, click on ‘Change date, time, or number formats’ as shown below –

Change date,time or number formats

Here are two changes that you would need to do –
1. Under Formats tab, set ‘Format:’ to ‘English (Australia)’ as shown below –

RegionFormats.png
2. Under Location tab, set ‘Home location:’ to ‘Australia’ as shown below –

RegionLocation

This step would necissitate restarting the system. In this example I have shown ‘Australia’ as locale as that’s where I am currently resisding. Do change it to relavent locale.

That’s it. Post restart you would start seeing the way it is expected as shown below –

FinalResult

SSIS vs Informatica – Error Handling

A close friend of mine who also has been, like me, until recently working on MS-BI all through his career has moved on to different stream for performing ETL i.e. using of Informatica.

Having been relatively new to Informatica (with only couple of years and few projects here and there), we had many a arguments on which is better and why. As I stated out in my first post about Informatica,  I didn’t have good feelers with it when starting out. There were too many IDE’s to grapple with, transformations do not look intuitive, simple tasks look very hard to perform etc. One of the arguments he did make is Informatica is a market leader and is lot more powerful owing to it’s code re-usability, out-of-the-box connectivity across varied sources, error handling etc and SSIS is doing now a lot of catch-up.

I decided to dig deep and validate my own findings on the ‘out-of-the-box connectivity and error handling’. This is one of such posts and more to come.

Task –

Load million records from a source table to a destination table. Out of the million records, there are about 10 error records. Source Table would be a constant. There would be two runs – first run with table in Oracle database as destination, second run with table in a SQL Server database as the destination.

I will use both SSIS and Informatica to accomplish this task and see how best both accomplish the task.

Setting up of Source –

I have Oracle 11g express set-up on the machine. My source table is simple table consisting of Id, Name columns with about a million records in it. Id is basically a auto-incrementing column. After it’s populated I am running a script to  update 10 random records to null. Below SQL code performs this –


--Source table to hold the million records
create table SrcEmployeeData
(
 Id int
 ,Name varchar(50)
);

--Insert million records in it. The value level auto-increments
insert into SrcEmployeeData
select level, 'test' from dual connect by level <= 1000000;

--Update 10 random records with Name as NULL
update SrcEmployeeData
set Name = null
where Id in
 (
 select Id
 from (
 select *
 from SrcEmployeeData
 order by DBMS_RANDOM.VALUE
 )
 where ROWNUM <=10
 );

--Commit the data
commit;

Once done, confirm to check there are 1 million records with about 10 records having NULL in Name column.

Task 1 – Tool – Informatica, Destination – Oracle
Here is the definition of destination table in Oracle

create table HR.EmployeeData(
    Id int
   ,Name varchar(50) not null
);

 

 

I created a mapping called m_Load_EmployeeData_Oracle which contains just two  transformations – Source and Destination as shown below –

1_1M_OracleDestinationLoad_Mapping

I then developed a workflow for it wf_m_Load_EmployeeData_Oracle. I then set the Connections in the ‘Config’ section of the session as appropriate and started the workflow. Here is the result –

2_1M_OracleDestinationLoad_Error.png

So right-off the bat, without any tweaking, the workflow got successfully executed and correctly captured the 10 error records with a execution time of 1m 33sec.

Task 2 – Tool – Informatica, Destination – SQL Server

I created destination table in SQL Server with same table definition as in Task 1. I then created a new mapping m_Load_EmployeeData_SQL with same mapping i.e. source connected to destination as shown below –

3_1M_SQLDestinationLoad_Mapping

Created the wofklow wf_Load_EmployeeData_SQL followed by setting appropriate ‘Connections’ in the session properties. Ran the workflow and here is the result –

3_1M_SQLDestinationLoad_Error

10 records successfully captured without any ado and in the same time. Absolutely no difference in time.

Task 3 – Tool – SSIS, Destination – SQL Server

Source still remains the same. Here comes the real kicker even before we get into comparison. If we set-up the DFT with a simple ‘OLE DB Source’ and ‘OLE DB Destination’ components and start the package, it is bound to fail as shown below –

5_1M_SQLDestinationLoad_Package_Error.png

Out-of-the box error capturing just doesn’t exist. Here is what I have done to capture the error records using the 3 DFT technique. This is how it works. Create three copies of your ‘OLE DB Destinations’. Connect your source to the first DFT. Set the properties ‘Rows per batch’ and ‘Maximum insert commit size’ values to 50000 as shown below –

7_1M_SQLDestinationLoad_Package_OLEDST_Setting

Under mappings, the columns would then be auto-mapped by name. Connect the ‘Error Output’ of the first DFT to the second DFT copy. Set the two values for that as 10000. Connect this second DFT’s error re-direction to the third DFT copy. For the final one set, the two values as 1. Rename the tasks appropriately.

Now drag in another DFT, and plug in the third copy of the DFT error output on to this new one. This time we are going to create a new destination table for capturing the error say ‘dbo.EmployeeData_Error’. The table definition is as listed below –


CREATE TABLE [dbo].[EmployeeData_Error](
[ID] [numeric](38, 0) NULL,
[NAME] [nvarchar](50) NULL,
[ErrorCode] [int] NULL,
[ErrorColumn] [int] NULL,
[DateInserted] [datetime] NULL
) ON [PRIMARY]
GO

ALTER TABLE [dbo].[EmployeeData_Error] ADD CONSTRAINT [DF_EmployeeData_Error_DateInserted] DEFAULT (getdate()) FOR [DateInserted]
GO

At the mapping level, ensure to connect the ErrorCode and ErrorColumn of the third copy of DFT wherein you are setting both values to 1, to the same columns of the table EmployeeData_Error as shown below –

10_1M_SQLDestinationLoad_Package_OLEDST_ErrorCapture.png

With this being set-up, the package is executed with no problem whatsoever as shown below-

11_1M_SQLDestinationLoad_Package_Execution

Looks hale and hearty. How about the speed of execution –

11_1M_SQLDestinationLoad_Package_Execution_Speed.png

It completed in just 14 seconds. The fastest so far in the tasks we performed. Now let’s go to the final task.

Task 4 – Tool – SSIS, Destination – Oracle

Before we get into the package set-up, one needs to make sure the Oracle Provider for OLE DB is present. If not, get the latest stable version from the following link. The package set-up would be very simple this time around. In total there would be three tasks – OLE DB Source for fetching the data, 2 OLE DB Destinations (1 for the destination and the other for capturing the error).

The OLE DB Destination for storing the target data will need to have the ‘Data access mode’ set to – ‘Table or View’ and one cannot use ‘Fast Load’ option. Below image shows the set-up that needs to be done –

12_1M_OracleDestinationLoad_Package_OLEDTSSetting.png

With that set, here is the snapshot of the package run –

16_1M_OracleDestination_Package_Execution

And the execution speed –

16_1M_OracleDestination_Package_ExecutionResults

Almost an Hour for just a million records. One whole hour for a dataset containing only two columns. I had a job wherein I was transferring about 2 million records from a much larger table and we had similar requirement to capture the error records. That package ran for almost 3 hours to complete. Anything that is not set to SQL Server as destination, SSIS performs horribly.

Conclusion-

SSIS offers very poor performance for data loads which differs from SQL Server. If you want more robustness, one needs to look outside for any third-party tool. Informatica on other hand just doesn’t care what the source and destinations are. It just works as expected giving uniform performance.

 

Visual Studio – “Build must be stopped before the solution can be closed” error

Time and again I have been getting this nasty error for no obvious reason as seen below –

BuildMustBeStopped

The first step I tried is to stop the build by going to – Build-> ‘Cancel Build’. That just doesn’t work. So here are the scripts I run to close my session –

$p = get-process devenv
Stop-Process $p -WhatIf
Stop-Process $p

I first get the process id in the variable. Check to see if what I need to close is what I intend and then proceed to stop it. That’s all there is to do. You now need to open a new instance of Visual Studio and get cracking on whatever it is that you are working on.

SSRS Report Migration 2012+ – Method 1

As Microsoft goes on rampage shortening their release time and sending not one but 3 versions in the last 6 years (2012, 2014, 2016!), it is time for us to have a look at various methodologies available to migrate reports from one ReportServer to another ReportServer. The reason I have brought this up is because for all my working career, the only tool that I had used to migrate reports is RSSScripter.exe. The original link is archived and is available from the following link – RS Scripter. This one works good for till SQL Server 2008 R2. What about the versions after that?

That’s the answer I want to solve through this post. In my quest to find out how to build a a package ready for deployment and further automating it, one of the first that I have got hands on is this – SSRS Powershell Deploy.

It’s bunch of powershell scripts that can be utilized to deploy the reports. The biggest downside to this method though is you can’t pick and chose the reports to deploy. You either deploy all or none at all. If the solution can be analyzed well enough, we may come up with a workaround for that. For now, let’s deep dive into, how to set the solution up and steps involved in deploying –

Once you open the link, click on ‘SSRS-1.3.0.zip’ folder. Follow the below steps –

1. Download the .zip from https://github.com/timabell/ssrs-powershell-deploy/releases/latest
2. Right-click the zip file in windows explorer, click “properties”, and then click “Unblock”.
3. Create folder ‘Documents\WindowsPowerShell\Modules\’
4. Open up the zip file, copy the SSRS folder, paste it into
`Documents\WindowsPowerShell\Modules\`. (Or somewhere on your
`$env:PSModulePath`)

You can test if you are having the modules imported or not by running a simple powershell command as shown below –

1_SSRSPowershellImport.png

As the names imply, if you want to deploy the SSRS Project (i.e. .rptproj file), then use Publish-SSRSProject else if you want to deploy the SSRS Solution (i.e. .sln file) then use Publish-SSRSSolution.

Using this let’s do one sample report deployment. I will first show a simplest way to deploy i.e. by using pre-filled Configuration data.. I have a project OpenDataReports containing one report ‘Top 10 Products.rdl’.

Once the solution is opened, go to Project-><<ProjectName>> Properties. The properties page opens up.
Perform the following changes as shown below –
Configuration – Set it to Release
TargetServerURL – http://<<MachineName>>/ReportServer<<_InstanceName>&gt;
In addition to that the other settings can be set as well, i.e. TargetDatasetFolder, TargetDataSourceFolder etc.

2_SSRSProjectProperties.png

With the properties set, now right-click on the Project and click on ‘Rebuild’ as shown below. Ensure it succeeds. –

3_SSRSProjectRebuild.png

Go to the project path and /bin/Release folder to ensure all the required reports and Data Sources are present. In my case here are the contents in the folder –

4_SSRSProjectbinReleaseContent.png

Finally the next step is to run the powershell command as shown below –

The command to run is a simple one –

Publish-SSRSProject -Path "<<FilePath>>\OpenDataReports.rptproj" -Configuration Release -Verbose

5_SSRSProjectDeploy.png

That’s the simple way to go about deploying reports. Of course, one need not do the pre-config settings, if you look at all the parameters provided for the script, each of the project parameters can be set at run time. I will posting a example of that in the next post.

Meanwhile, do leave a comment on how deployments are happening in your environments.

 

 

Bye Bye SQL Profiler – Welcome Extended Events

SQL Profiler for long has been THE go to tool for tracing the queries and if you are a SQL Developer it would be a miracle if you haven’t used it at all. In every facet of BI stack, this comes into play be it a SSIS package that is currently run, or in understanding a blank SSRS report or a Cube that is getting processed at the background one can hook up a trace as the first line of debugging.

The biggest problem with this though is that it ALWAYS had to be used within a limited time. You extend it longer than intended and all the activities on the SQL Server tend to slow down as it is a resource-intensive operation.

Sensing this I believe, Microsoft first came out with Extended Events with SQL Server 2008 version. It was horrendous to say the least. At that time this had to be entirely done through bunch of scripts, joining multiple tables with addition of XQuery to grab the actual data that we need. The learning curve to get this done was huge. I admit I had read various tutorials, did some practice but when I really wanted to use it I used to get cold feet. Without googling at least twice, this was a no-go and I used to fall back on SQL Profiler.

SQL Server 2012 onward, Microsoft has introduced GUI for Extended Events making it now a real easy breeze to work with. This has now been the de-facto tool that I use for tracing the queries now.

Here is how one can go about setting one up and using it to trace your SQL queries –

Log on to SQL Server and go to Management->Extended Events->Sessions. Right-Click on ‘New Session Wizard’ as shown below –

1_Opening_NewSessionWizard.png

Click ‘Next>’ on the Introduction screen. In the ‘Set Session Properties’ tab give a name to the Session say – All SQL Queries as shown below and click on ‘Next>’

2_SetSessionProperties.png

In ‘Choose Templates’ tab click on ‘Use this event session template:’ and in the drop down select ‘Query Batch Tracking’ as shown below and then click on ‘Next >’ –

3_ChooseTemplate.png

In the ‘Select Events to Capture’ tab across the ‘Selected events’: you can remove error_reported, rpc_completed and only have sql_batch_completed in it as shown below and click on ‘Next >’. You can remove the other two that come by default by just clicking on them and using the ‘<‘ arrow button –

4_SelectEventsToCapture

Keep the defaults as they are for the subsequent screens and click on ‘Finish’ in the ‘Summary’ page. In the ‘Create Event Session’ page that pops-up, to start seeing your results immediately, you are provided with two options. Enable them and click on ‘Close’ as shown below –

5_CreateEventSession.png

Upon doing that the trace is now up and running waiting for the queries to start. Here is the output of showing one query that was run against one database in the server –

6_QueryResult

As can be seen this one is more cleaner and more easier to read without the unnecessary redundant information that used to come along from SQL profiler.

This is just a tip of the iceberg from what we can achieve using Extended Events. There is an ocean out there to explore but for those who are seeking to trace out a query this should be a good start.

 

 

 

 

SSIS Lookup Gotcha – Test Fully

The task I had on my hands was simple. I had to perform lookup on a target table with my source data and get all non matched data. The data that I was fetching for the lookup from target had additional string manipulation done to obtain the required lookup data. I then went on to write a query to fetch that data and testing on a sample subset (using TOP 10) to check for it’s correctness and went on to implement.

I set everything up and when I ran the package the lookup was just now working. The rows that I expected to have a match were simply getting redirected to no-match. When I started debugging, it then became apparent that the query that I had used to retrieve the dataset in the lookup was an incorrect one. This time around I ran the query without the subset and the query failed to get executed showing me the actual error.

Moral of the story is one should not rely on ‘Preview’ results offered from the ‘Lookup’ transform or use a subset when sampling the data. It should be checked to see if the query works for the entire set.

Let me illustrate this with an example.

Below is a sample of Source Data –

 

FileName
Test1.txt
Test3.txt

Here is the data that I am doing a lookup on from a table say dbo.LookupTable (a simple Id and FilePath column) –

1_TargetDataSample
The string manipulation query for this is as you might have already guessed, to get FileName from the ‘FilePath’.

select Id,RIGHT(FilePath, CHARINDEX('\', REVERSE(FilePath)) -1) as 'FileName'
from dbo.LookupTable;<strong>
</strong>

I have a simple Data Flow Task which does the following –

  1. OLE_SRC – Get Source Data – Gets the source data as shown above.
  2. LKP – Get Id – Using the query above, it fetches the Id and FileName. Note that the moment you put that query and go to ‘Columns’, it will throw up an error. For the purpose of this illustration just ignore it and do the mapping.
  3. Trash Destination – A priceless open-sourced transformation from Konesans and a must  have development aid.

Here is how the package looks like after running it-

2_DataFlowTaskContents

As can be seen only 1 row gets shown as matched even though there are 2 matching rows.

The data in this sample illustration is very small but the original data that I had about 100k records and it was difficult to debug on why this error occurred.

 

Informatica 101 – Where are my Foreach loops? – 1

One of the most used transformation that I use as part of SSIS is the Foreach loop which *gasp* just isn’t there in Informatica out-of-the-box. In this post, let’s look at how it’s done.

Test Data


I have three files named StudentsData001.txt, StudentsData002.txt, StudentsData003.txt. In each of the file, there are two columns – Id, Name such as

Id,Name
1,Karthik
2,Pratap

Our goal is to integrate all the data from all the three files into a table Students which has two columns Id and Name.

Steps to be done


Open the PowerCentre Designer. Go to Source and click on Import from file as shown below –

1_SourceImportFlatFile

Chose one of file from the source folder and in the next window, ensure to check on ‘Import field names from first line’. –

2_FlatFileImportWizard

Accept the defaults and complete the wizard by setting the appropriate size for the ‘Name’ column. Once the source is set-up, we would need to set-up the destination. Go to ‘Tools’->’Target Designer’ and then go to – ‘Targets’ -> ‘Import from Database’ as shown below and connect to the ‘Students’ table where you want all the data to be imported-
3_CreateDestination

Go to ‘Tools’ -> ‘Mapping Designer’ and then go to – ‘Mappings’->’Create’. Drag and drop both the source and destination on to the mapping designer and then connect the source data on to the destination as shown below for ‘Source’ –
5_Mapping_DragAndDrop

Connect both the columns from ‘Source Qualifier’ and connect to ‘Destination’ by dragging and the dropping the columns.

6_Mapping_JoiningDestination.png

Once it is set, generate the workflow by going to ‘Mappings’ -> ‘Generate Workflow’. Just follow the wizard and click on ‘Next’ till it finishes.

7_WorkflowGeneration.png

What we have done so far is a basic set-up just like in SSIS when we do the DFT containing the Flat File Source to the OLE-DB Destination (or whichever destination). The actual work of looping is now to be done at the session level. Even before we get to that, what we need to remember is – Informatica has two modes of handling Flat Files – Direct and Indirect Mode. We need to use Indirect Mode to handle the multiple files list.

First thing is we need to create a file containing the list of all the file names in it. I created a file called ImportFileList.txt as shown below –

10_IndirectFileList.png

Open the PowerCentre Workflow. Open the worfklow wf_m_FF_To_Database and then double-click on the session – s_m_FF_To_Database. ‘Edit Tasks’ window would open up as shown below –

11_SessionSQAttributeSetting_0.png

Go to Mappings tab and under ‘Sources’ click on the ‘SQ_StudentsData001’. In the bottom most window set the following properties –

  • Source filetype – Indirect
  • Source file directory – Path of the directory where the InputFileList.txt is present.
  • Source filename – ImportFileList.txt

11_SessionSQAttributeSetting.png

That’s about it. Those are the basic changes that are to be done. Save the workflow after performing the above changes. Also ensure, under Connections tab the , Students – DB Connection connection name is valid.

The next step is to start the workflow by going to – Workflows -> Start Workflow. Once it is successful you would see that all files are processed and the ImportFileLIst.txt is no longer existing, The results post-processing are shown below –

12_PostWorkflowResults.png

Next Steps –

  • How do go about automating this file generation file? Am guessing we would need to have a command task in workflow reading the files and generating the file.
  • How does it change from DEV to say UAT i.e. basically how do we localize it?

 

 

 

Import and Export Wizard – Handling Dates in Flat Files

As they say there is first time for everything. Having worked on so many packages all through my career never once I had a need to import a flat file containing dates directly from SQL Server Import and Export Wizard. It also may be a case of not having the data type set as ‘Date’ in the column and rather take it as a varchar value when doing imports.

So today as part of some task I had a requirement to import data with some date columns in it. Let’s say data for my column data looks like this in a file called DateTest.txt. Note that there is a row with blank data –


StartDate,EndDate
2017-01-01,2017-01-01

2017-05-01,2017-05-01

So I opened up the SQL Server Import Wizard, set the Data Source as – Flat File Source and browsed and obtained the File as shown below –1_SQLServerImport_General.png

Now go to Advanced and set the properties for both the columns StartDate and EndDate as ‘DT_DBDATE’ which translates to date datatype of SQL Server. For more info refer to – link. The screenshot below is for ‘EndDate’. Do the same for StartDate column as well. 2_SQLServerImport_Advanced_SetDataType.png

Set the Destination to your local Database say RnD as in my case as shown below –3_SQLServerImport_Advanced_SetDestination.png

In the ‘Select Source Tables and Views’, leave it as is and click on ‘Next>’ (This will create the table by default). Leave the defaults as is in the ‘Save and Run Package’ screen, leave the defaults and click on ‘Next>’. Click on Finish in the last page of the wizard. You will see the ‘Operation Stopped’ with ‘Copying to [dbo].[DateTest]’ set to error as shown below.

4_SQLServerImport_EndOfWizard_OperationStopped.png

If you dig further in the Messages, here is what it throws up the following error-

5_SQLServerImport_EndOfWizard_OperationStopped_ErrorMessage.png

The error states –

An OLE DB record is available.  Source: “Microsoft OLE DB Provider for SQL Server”  Hresult: 0x80004005  Description: “Conversion failed when converting date and/or time from character string.”.
 (SQL Server Import and Export Wizard)

The blank values are treated as strings and that is what the error states.

Solution 1 –

Instead of setting the data type as ‘DT_DBDATE’ set is as ‘DT_DATE’ then it will pass. There are two side-effects to this –

  • The destination column would be of type datetime instead of date.
  • All the blank values will be set as ‘1899-12-30 00:00:00.000’ as can be seen below6_SQLServerImport_Solution1_Result.png

It’s not a optimal solution. If you are just importing one file with limited columns of such type and if the destination  table on which are you are importing isn’t a large table then we can go with this approach. Depending on the use case, one can then proceed to either update the values as NULL or leave it as is.

Solution 2 –

This involves creating a package out of the same operations. Now one can go about it in the traditional way i.e. open SQL Server Data Tools (Visual Studio), add new package, drag and drop a DFT, yada yada.

Instead let’s replicate the same behavior as before. How you may say. Did you know that one can fire up a ‘SQL Server Import Wizard’ from SQL Server Data Tools itself? Now before we go further, if you have been following along, the table ‘dbo.DateTest’ in your destination should be existing.

Open a SSIS Project and go to ‘PROJECT’->’SSIS Import and Export Wizard…’ as shown below –

7_SQLServerImport_Solution2_VisualStudioProject.png

The wizard looks exactly the same as fired from SQL Server. At the end you will notice the difference. Follow the same steps that you have done earlier i.e. by setting the data as ‘DT_DBDATE’. Instead of executing, it creates a package and the final window would look like this. It will create a new package called Package1.dtsx if there isn’t one already. If there is one it would create Package2.dtsx. –

8_SQLServerImport_Solution2_VisualStudioProject_EndResult.png

At this point if you run the package that gets generated automatically as is you will get the same error. Here are the changes that are to be done.

Open the Control Flow Task – ‘Data Flow Task 1’. In the Data Flow task, open the task ‘ Source – DataTest_txt’. Ensure that ‘Retain null values from the sources as null values in the data flow’ as shown below –

8_SQLServerImport_Solution2_VisualStudioProject_RetainNull.png

Secondly double-click on the ‘SourceConnectionFlatFile’ connection manager, go to Advanced and modify the data types of StartDate and EndDate to DT_STR and length 10. Below image is shown for EndDate. Do it for StartDate as well.

20_ImportExport_AdvancedConnectionManager

Since the source connection manager is changed, the DFT – ‘Source-DataText_txt’ needs a change. Double-click on the DFT and you will be presented with the changes as shown below. Accept them.21_ImportExport_RefreshFlatFileConnectionManager

Delete the connector between Flat File and the OLE DB Destination and drag in a Derived Transform in between them. Add the following two expressions as shown below –

  • Derived Column Name  – DC_StartDate ; Expression – StartDate == “” ? NULL(DT_DBDATE) : (DT_DBDATE)StartDate
  • Derived Column Name  – DC_EndDate ; Expression – EndDate == “” ? NULL(DT_DBDATE) : (DT_DBDATE)EndDate

 

17_SQLServerImport_Solution2_DerivedColumnTransform.png

Connect the Derived Transform output to the OLE DB Destination and set the mappings with the newly transformed columns. 24_ImportExport_OLEDBDestinationMapping

In addition to that set the ‘Keep Nulls’ property to yes.

23_ImportExport_OLEDBDestination

That’s it. Now execute the package. All the three records would now get successfully

25_ImportExport_ExecutionResult

Data gets transferred with the blank values retained as NULL values as shown below –26_ImportExport_ExecutionResultTable

To summarize the solution, basically out-of-the box Import/Export wizard will not work with getting NULL values to date columns. Here are the changes to be done –

  • Set all the date columns for which you would want to retain NULL for blanks as String values
  • Add a Derived Transformation to change the data type of the data to DT_DBDATE.
  • Set the retain null property at both source and destination to yes.

Solution 3 –

The ideal solution should be the one wherein one can use DT_DBDATE at the source itself and it should go through to destination. For some reason I have been getting strange errors while doing it as shown below –

Error:Year, Month, and Day parameters describe an un-representable DateTime.

. I am still working on it. Once I get a better solution, will post it here.