SSIS vs Informatica – Error Handling

A close friend of mine who also has been, like me, until recently working on MS-BI all through his career has moved on to different stream for performing ETL i.e. using of Informatica.

Having been relatively new to Informatica (with only couple of years and few projects here and there), we had many a arguments on which is better and why. As I stated out in my first post about Informatica,  I didn’t have good feelers with it when starting out. There were too many IDE’s to grapple with, transformations do not look intuitive, simple tasks look very hard to perform etc. One of the arguments he did make is Informatica is a market leader and is lot more powerful owing to it’s code re-usability, out-of-the-box connectivity across varied sources, error handling etc and SSIS is doing now a lot of catch-up.

I decided to dig deep and validate my own findings on the ‘out-of-the-box connectivity and error handling’. This is one of such posts and more to come.

Task –

Load million records from a source table to a destination table. Out of the million records, there are about 10 error records. Source Table would be a constant. There would be two runs – first run with table in Oracle database as destination, second run with table in a SQL Server database as the destination.

I will use both SSIS and Informatica to accomplish this task and see how best both accomplish the task.

Setting up of Source –

I have Oracle 11g express set-up on the machine. My source table is simple table consisting of Id, Name columns with about a million records in it. Id is basically a auto-incrementing column. After it’s populated I am running a script to  update 10 random records to null. Below SQL code performs this –


--Source table to hold the million records
create table SrcEmployeeData
(
 Id int
 ,Name varchar(50)
);

--Insert million records in it. The value level auto-increments
insert into SrcEmployeeData
select level, 'test' from dual connect by level <= 1000000;

--Update 10 random records with Name as NULL
update SrcEmployeeData
set Name = null
where Id in
 (
 select Id
 from (
 select *
 from SrcEmployeeData
 order by DBMS_RANDOM.VALUE
 )
 where ROWNUM <=10
 );

--Commit the data
commit;

Once done, confirm to check there are 1 million records with about 10 records having NULL in Name column.

Task 1 – Tool – Informatica, Destination – Oracle
Here is the definition of destination table in Oracle

create table HR.EmployeeData(
    Id int
   ,Name varchar(50) not null
);

 

 

I created a mapping called m_Load_EmployeeData_Oracle which contains just two  transformations – Source and Destination as shown below –

1_1M_OracleDestinationLoad_Mapping

I then developed a workflow for it wf_m_Load_EmployeeData_Oracle. I then set the Connections in the ‘Config’ section of the session as appropriate and started the workflow. Here is the result –

2_1M_OracleDestinationLoad_Error.png

So right-off the bat, without any tweaking, the workflow got successfully executed and correctly captured the 10 error records with a execution time of 1m 33sec.

Task 2 – Tool – Informatica, Destination – SQL Server

I created destination table in SQL Server with same table definition as in Task 1. I then created a new mapping m_Load_EmployeeData_SQL with same mapping i.e. source connected to destination as shown below –

3_1M_SQLDestinationLoad_Mapping

Created the wofklow wf_Load_EmployeeData_SQL followed by setting appropriate ‘Connections’ in the session properties. Ran the workflow and here is the result –

3_1M_SQLDestinationLoad_Error

10 records successfully captured without any ado and in the same time. Absolutely no difference in time.

Task 3 – Tool – SSIS, Destination – SQL Server

Source still remains the same. Here comes the real kicker even before we get into comparison. If we set-up the DFT with a simple ‘OLE DB Source’ and ‘OLE DB Destination’ components and start the package, it is bound to fail as shown below –

5_1M_SQLDestinationLoad_Package_Error.png

Out-of-the box error capturing just doesn’t exist. Here is what I have done to capture the error records using the 3 DFT technique. This is how it works. Create three copies of your ‘OLE DB Destinations’. Connect your source to the first DFT. Set the properties ‘Rows per batch’ and ‘Maximum insert commit size’ values to 50000 as shown below –

7_1M_SQLDestinationLoad_Package_OLEDST_Setting

Under mappings, the columns would then be auto-mapped by name. Connect the ‘Error Output’ of the first DFT to the second DFT copy. Set the two values for that as 10000. Connect this second DFT’s error re-direction to the third DFT copy. For the final one set, the two values as 1. Rename the tasks appropriately.

Now drag in another DFT, and plug in the third copy of the DFT error output on to this new one. This time we are going to create a new destination table for capturing the error say ‘dbo.EmployeeData_Error’. The table definition is as listed below –


CREATE TABLE [dbo].[EmployeeData_Error](
[ID] [numeric](38, 0) NULL,
[NAME] [nvarchar](50) NULL,
[ErrorCode] [int] NULL,
[ErrorColumn] [int] NULL,
[DateInserted] [datetime] NULL
) ON [PRIMARY]
GO

ALTER TABLE [dbo].[EmployeeData_Error] ADD CONSTRAINT [DF_EmployeeData_Error_DateInserted] DEFAULT (getdate()) FOR [DateInserted]
GO

At the mapping level, ensure to connect the ErrorCode and ErrorColumn of the third copy of DFT wherein you are setting both values to 1, to the same columns of the table EmployeeData_Error as shown below –

10_1M_SQLDestinationLoad_Package_OLEDST_ErrorCapture.png

With this being set-up, the package is executed with no problem whatsoever as shown below-

11_1M_SQLDestinationLoad_Package_Execution

Looks hale and hearty. How about the speed of execution –

11_1M_SQLDestinationLoad_Package_Execution_Speed.png

It completed in just 14 seconds. The fastest so far in the tasks we performed. Now let’s go to the final task.

Task 4 – Tool – SSIS, Destination – Oracle

Before we get into the package set-up, one needs to make sure the Oracle Provider for OLE DB is present. If not, get the latest stable version from the following link. The package set-up would be very simple this time around. In total there would be three tasks – OLE DB Source for fetching the data, 2 OLE DB Destinations (1 for the destination and the other for capturing the error).

The OLE DB Destination for storing the target data will need to have the ‘Data access mode’ set to – ‘Table or View’ and one cannot use ‘Fast Load’ option. Below image shows the set-up that needs to be done –

12_1M_OracleDestinationLoad_Package_OLEDTSSetting.png

With that set, here is the snapshot of the package run –

16_1M_OracleDestination_Package_Execution

And the execution speed –

16_1M_OracleDestination_Package_ExecutionResults

Almost an Hour for just a million records. One whole hour for a dataset containing only two columns. I had a job wherein I was transferring about 2 million records from a much larger table and we had similar requirement to capture the error records. That package ran for almost 3 hours to complete. Anything that is not set to SQL Server as destination, SSIS performs horribly.

Conclusion-

SSIS offers very poor performance for data loads which differs from SQL Server. If you want more robustness, one needs to look outside for any third-party tool. Informatica on other hand just doesn’t care what the source and destinations are. It just works as expected giving uniform performance.

 

Informatica 101 – Where are my Foreach loops? – 1

One of the most used transformation that I use as part of SSIS is the Foreach loop which *gasp* just isn’t there in Informatica out-of-the-box. In this post, let’s look at how it’s done.

Test Data


I have three files named StudentsData001.txt, StudentsData002.txt, StudentsData003.txt. In each of the file, there are two columns – Id, Name such as

Id,Name
1,Karthik
2,Pratap

Our goal is to integrate all the data from all the three files into a table Students which has two columns Id and Name.

Steps to be done


Open the PowerCentre Designer. Go to Source and click on Import from file as shown below –

1_SourceImportFlatFile

Chose one of file from the source folder and in the next window, ensure to check on ‘Import field names from first line’. –

2_FlatFileImportWizard

Accept the defaults and complete the wizard by setting the appropriate size for the ‘Name’ column. Once the source is set-up, we would need to set-up the destination. Go to ‘Tools’->’Target Designer’ and then go to – ‘Targets’ -> ‘Import from Database’ as shown below and connect to the ‘Students’ table where you want all the data to be imported-
3_CreateDestination

Go to ‘Tools’ -> ‘Mapping Designer’ and then go to – ‘Mappings’->’Create’. Drag and drop both the source and destination on to the mapping designer and then connect the source data on to the destination as shown below for ‘Source’ –
5_Mapping_DragAndDrop

Connect both the columns from ‘Source Qualifier’ and connect to ‘Destination’ by dragging and the dropping the columns.

6_Mapping_JoiningDestination.png

Once it is set, generate the workflow by going to ‘Mappings’ -> ‘Generate Workflow’. Just follow the wizard and click on ‘Next’ till it finishes.

7_WorkflowGeneration.png

What we have done so far is a basic set-up just like in SSIS when we do the DFT containing the Flat File Source to the OLE-DB Destination (or whichever destination). The actual work of looping is now to be done at the session level. Even before we get to that, what we need to remember is – Informatica has two modes of handling Flat Files – Direct and Indirect Mode. We need to use Indirect Mode to handle the multiple files list.

First thing is we need to create a file containing the list of all the file names in it. I created a file called ImportFileList.txt as shown below –

10_IndirectFileList.png

Open the PowerCentre Workflow. Open the worfklow wf_m_FF_To_Database and then double-click on the session – s_m_FF_To_Database. ‘Edit Tasks’ window would open up as shown below –

11_SessionSQAttributeSetting_0.png

Go to Mappings tab and under ‘Sources’ click on the ‘SQ_StudentsData001’. In the bottom most window set the following properties –

  • Source filetype – Indirect
  • Source file directory – Path of the directory where the InputFileList.txt is present.
  • Source filename – ImportFileList.txt

11_SessionSQAttributeSetting.png

That’s about it. Those are the basic changes that are to be done. Save the workflow after performing the above changes. Also ensure, under Connections tab the , Students – DB Connection connection name is valid.

The next step is to start the workflow by going to – Workflows -> Start Workflow. Once it is successful you would see that all files are processed and the ImportFileLIst.txt is no longer existing, The results post-processing are shown below –

12_PostWorkflowResults.png

Next Steps –

  • How do go about automating this file generation file? Am guessing we would need to have a command task in workflow reading the files and generating the file.
  • How does it change from DEV to say UAT i.e. basically how do we localize it?

 

 

 

Informatica 101 – Staring the first mapping

At this point I am assuming you have been able to successfully install the Informatica 9.6.1 legally by following the steps listed out in this article – link. 

First let’s define the problem statement.

Task –

Load data from a table into a flat file. The table data that I am loading is from the good old AdventureWorks . Here is the query which I have stored in a view – Person.vw_GetTop10Persons –

select top 10 BusinessEntityID
,FirstName
,LastName
from Person.Person
order by BusinessEntityID;

Steps –

Open the PowerCenter Repository and connect to the the Repository that you have created. Go to ‘Folder’->’Create’ and give it a name say ‘InformaticaLearning’ and click on ‘OK’ as shown below

.CreateRepositoryFolder.png

Click on Start and open up the PowerCenter Designer and connect to the ‘Repository’. The ‘InformaticaLearning’ project that we created in the previous step would now appear. Right-Click on the folder and click ‘Open’ as shown below. This is how a project is opened                                 Opening project from Repository

Unlike SSIS wherein you drag a Data Flow Task and within which you start creating source and destination, the process here is to create sources and destinations as separate entities. First step is to create a Source. Go to Sources->Import from Database as shown below –

SourceImportDatabase

In the ‘Import Tables’ window, click on the ellipsis button. The ODBC Source Administrator(32-bit) window pops up. Click ‘Add’ and in the ‘Create New Data Source’ Window scroll down to the appropriate SQL Server client and click on ‘Finish’. I have SQL Server 2014 as well as SQL Server 2008R2 and the driver that I am choosing is SQL Server Native Client 11.0. as seen belowCreatingSourceFromImportTables.png

The following screens give an indication of the next steps in creating the data source.
NewDataSource_1NewDataSource_2                                     Ensure to change the default database to the actual  Database from which you are sourcing the data which in this case is – AdventureWorks2014NewDataSource_3.png

NewDataSource_4

NewDataSource_Finish

The ‘User Data Sources’ should now contain ‘AdventureWorks’. Click on ‘OK’. You will again get back to the original ‘Import Tables’ window. Here you would need to select the source again from the drop-down and click on ‘Connect’. All the tables would appear in the ‘Select tables’ window.ImportTables_Tables.png

In the ‘Search for tables named:’ section type the name of the view – vw_GetTop10Persons and click on ‘Search’. The narrowed result will now appear. Now click on ‘OK’ as shown below – ImportTables_Tables_2

The ‘Source’ is set. Now let’s create the destination. Go to Tools->Target Designer. Now go to ‘Targets’->’Create’ as shown below. – Targets_Create_NewTarget.png

In the ‘Create Target Table’ window, enter the name for the target say ‘Top10Persons’ and for ‘Select a database type:’ select it as ‘Flat File’ and click on ‘Create’ and then press ‘Done’ once done.

Targets_Create_NewTarget_2.png

You now need to define the column details. Double-Click on the newly created table and define the column details. You may be wondering where to put the file path right? It’s not done here. It is done at the ‘Workflow’ level – Targets_Create_NewTarget_3.png

Now go to ‘Tools’ -> ‘Mapping Designer’. Click on ‘Mappings’->’Create’ as shown below and give it a name m_SQL_Persons_To_FF and click on ‘OK’

Mappings_Create

In the ‘Mapping Designer’ pane, drag and drop the source – vw_GetTop10Persons on to the ‘Designer’ as shown below just same as how we drag and drop Data Flow Components. You would see that the source would appear along with additional ‘SQ’ block. This is called ‘Source Qualifier’. This is basically used as a stop for homogenizing all the source data. Every source when dragged gets a source qualifier associated. We will see more on that later on.  Here is the window of the Source along with the Source Qualifier – Mappings_Create_2Mappings_Create_3

Right-Click on any column in the Source Qualifier and click on ‘Select All’  Mappings_Create_4

Click on the Square box beside the column ‘BusinessEntityID’ in the selected list and drag it to the exact column ‘BusinessEntityId’ at the destination as shown below.Mappings_Create_5

Once done, the mappings between both the items will be seen.Mappings_Create_6

Click on ‘Mappings’->Validate and the following information would appear giving the results of the validation.Mappings_Create_SaveResults

Next step is creating a ‘Workflow’ for this as this can’t run independently. Go to Mappings->Generate Workflow as shown below -.                                                           Workflow_Generate

In the ‘Workflow Generation’ window, accept the default and click on ‘Next>’ until the 4th step and then click on ‘Finish’.

Go to Tools->Workflow Manager. Expand the ‘Workflows’ and one can find the newly created ‘wf_m_SQL_Persons_To_FF’. Right-click and click on ‘Open’ as shown below –

Workflow_Start.png

We first need to create the ‘Data Sources’ again this time for ‘AdventureWorks’. Go to ‘Connections’->’Relational’. Click on ‘New’ –

1_RelationalConnection.png

In the ‘Select Subtype’ window, click on ‘Microsoft SQL Server’ and click on ‘OK. New Connection Object Definition opens. Note that this is entirely different from the ODBC source that you had configured while creating it in the Designer. For attributes the following details have to be supplied as highlighted.

2_RelationalConnection.png

Click on ‘OK’ and then with the newly created object ‘AdventureWorks’ selected click on ‘Close’.

Double-click on the session object ‘s_m_SQL_Persons_To_FF’. The ‘Edit Tasks’ window will now appear. Go to ‘Mapping’ tab and click on ‘Connections’. For the SQ type, we need to change the value to the one we just created. So click on the down arrow button as shown below and select on ‘AdventureWorks’ in the ‘Relational Connections Browser’ pop-up.

3_RelationalConnection.png

We now need to configure the Flat File Connection details. For this click on ‘Files, Directories and Commands’.Give the values for ‘Output file directory’ and ‘Output filename’ as shown below

4_RelationalConnection.png

Go to ‘Workflows’->Validate and ensure there are no errors. Click on ‘Ctrl-S’ to save. Again go to ‘Workflows’->Start Workflow. To understand the progress of the Workflow, we need to check in the PowerCenter Monitor. Click on Tools->Workflow Monitor and navigate to the project and the workflow.

You will get a Gantt Chart view of the run indicating the success as seen below –

5_RelationalConnection.png

And that my friends is how we create a ‘Hello World’ equivalent of package in Informatica. Looks complicated isn’t it? Slowly, you will get used to it.

Next post is all about how to generate a dynamic file naming.

Informatica 101 – How it compares to SSIS

Recently as part of my project I had a chance to finally get a hands on Informatica and how it does the standard ETL. The work that I was given was mostly to validate the counts by opening up the transformations and the actual counts that were obtained after the workflows were run. It really piqued my interest and I wanted to get a real hang of it.

So I took up the course ‘Informatica PowerCenter 9.X Developer & Admin’ in Edureka thanks to sponsorship by my current company. Over 2 weeks and with 3 hrs per week I felt I have grasped sufficient information. We had a good trainer that talked us through various aspects of the tool. He kept a use case, explained us what is it we are trying to achieve and then give a walk-through of the whole process which I could then replicate on my own.

At the end of the course we were given a small project to load a data warehouse with some dimensions of type 1 and some of type 2 into a single fact table which I completed comfortably.

Having over a decade of experience in SSIS, my mind was constantly comparing both and at the moment I feel performing certain simple tasks in Informatica are very hard compared to SSIS. So here goes some of the mental notes –

Tool Overview –

  • You need, brace for it, 4 separate tools to develop, deploy and monitor the ‘workflows’ a.k.a packages in SSIS.
    • Informatica PowerCentre Repository Manager- The place where you need to create the project similar to how one creates project in SSIS. That’s where it ends though. One cannot do any development here. This is the tool that one mainly uses to either import/export objects. Additionally I can see that it also gives a detailed info on various objects (will talk about it in future posts) such as Data Type, Precision, Key Type etc.
    • Informatica PowerCentre Designer – This is the tool where the actual development happens. Here is where you create your sources, destination, transformations/mappings. The end product out of designer is a mapplet which is a culmination of Source, Transformation and Destination. In SSIS parlance, developing a package with only Data Flow tab available.
    • Informatica PowerCentre Workflow Manager – All the mappings that were created in the designer need to be brought into the workflow. Prior to that we would first create a session, select the appropriate mapping. In SSIS parlance, one can say having just the Control Flow tab available.
    • Informatica PowerCentre Monitor – Once you create the workflow, here is where you need to connect and run the workflows, monitor the output, restart if needed. This plays the role of SQL Server Agent.
  • It is a metadata driven tool i.e. when the tool is installed you would need to specify the database on which it needs to be hosted where it creates all the tables, stored procedures, views etc. This is called Informatica PowerCenter Administrator console.

Dislikes –

  • No ‘Test Connection’ – Unlike in SSIS where sources and destinations are intrinsic to the package, here they are separate components of the project. Starting from SSIS 2012 with the advent of Project Parameters, this feature should sound familiar. The pain point though is that when you create OLE DB Sources/ Destinations there is no ‘Test Connection’ button to validate.
  • Re-creation of Source/Destination – As I was detailing earlier, your core development happens in PowerCenter Designer where. Once you are done you need to open the PowerCenter Workflow, create a session for the mapplet from Designer. Here again when it comes to OLE DB Sources/ Destination you would need to create again! It doesn’t come right away.
  • Disconnect-Reconnect – Let’s say you start with a mapplet. You create a workflow for it. So you will have two tools open – Designer as well as a Workflow. From the Workflow Manager you execute it and that’s about it. Let’s say you start to work on new mapplet. From the Workflow Manager you would need to disconnect and then reconnect to see the newly created workflow.
  • Dynamic File Naming – It’s just so hard. In SSIS, it’s all about getting the ConnectionString attached to a variable and that’s about. Here you would need at least 4 variables to do that.
  • No Loop Containers – There is absolutely no looping containers such as For Loop, Foreach Loop containers available. At this moment I don’t even know what the replacements are.
  • No Stored Procedure data as a source – Out of the box this just can’t be done. Again like above it doesn’t have a straightforward way.
  • No Data Viewers

 

Likes –

  • Component Re-usability – One can create a component and have it re-used across various mapping. This is what they call Transformations.
  • Ability to connect to any data source – Unlike SSIS, which is a beauty when it comes to SQL Server Destination and a beast when dealing with others, Informatica makes it a breeze no matter what source and destinations there are

The journey has just begun. In my next post I will give a detail on how to start one simple package and get it executed in Informatica in comparison with SSIS. Meanwhile I am also looking to see who I can get to work on Looping / Stored Procedure.