SSIS vs Informatica – Error Handling

A close friend of mine who also has been, like me, until recently working on MS-BI all through his career has moved on to different stream for performing ETL i.e. using of Informatica.

Having been relatively new to Informatica (with only couple of years and few projects here and there), we had many a arguments on which is better and why. As I stated out in my first post about Informatica,  I didn’t have good feelers with it when starting out. There were too many IDE’s to grapple with, transformations do not look intuitive, simple tasks look very hard to perform etc. One of the arguments he did make is Informatica is a market leader and is lot more powerful owing to it’s code re-usability, out-of-the-box connectivity across varied sources, error handling etc and SSIS is doing now a lot of catch-up.

I decided to dig deep and validate my own findings on the ‘out-of-the-box connectivity and error handling’. This is one of such posts and more to come.

Task –

Load million records from a source table to a destination table. Out of the million records, there are about 10 error records. Source Table would be a constant. There would be two runs – first run with table in Oracle database as destination, second run with table in a SQL Server database as the destination.

I will use both SSIS and Informatica to accomplish this task and see how best both accomplish the task.

Setting up of Source –

I have Oracle 11g express set-up on the machine. My source table is simple table consisting of Id, Name columns with about a million records in it. Id is basically a auto-incrementing column. After it’s populated I am running a script to  update 10 random records to null. Below SQL code performs this –


--Source table to hold the million records
create table SrcEmployeeData
(
 Id int
 ,Name varchar(50)
);

--Insert million records in it. The value level auto-increments
insert into SrcEmployeeData
select level, 'test' from dual connect by level <= 1000000;

--Update 10 random records with Name as NULL
update SrcEmployeeData
set Name = null
where Id in
 (
 select Id
 from (
 select *
 from SrcEmployeeData
 order by DBMS_RANDOM.VALUE
 )
 where ROWNUM <=10
 );

--Commit the data
commit;

Once done, confirm to check there are 1 million records with about 10 records having NULL in Name column.

Task 1 – Tool – Informatica, Destination – Oracle
Here is the definition of destination table in Oracle

create table HR.EmployeeData(
    Id int
   ,Name varchar(50) not null
);

 

 

I created a mapping called m_Load_EmployeeData_Oracle which contains just two  transformations – Source and Destination as shown below –

1_1M_OracleDestinationLoad_Mapping

I then developed a workflow for it wf_m_Load_EmployeeData_Oracle. I then set the Connections in the ‘Config’ section of the session as appropriate and started the workflow. Here is the result –

2_1M_OracleDestinationLoad_Error.png

So right-off the bat, without any tweaking, the workflow got successfully executed and correctly captured the 10 error records with a execution time of 1m 33sec.

Task 2 – Tool – Informatica, Destination – SQL Server

I created destination table in SQL Server with same table definition as in Task 1. I then created a new mapping m_Load_EmployeeData_SQL with same mapping i.e. source connected to destination as shown below –

3_1M_SQLDestinationLoad_Mapping

Created the wofklow wf_Load_EmployeeData_SQL followed by setting appropriate ‘Connections’ in the session properties. Ran the workflow and here is the result –

3_1M_SQLDestinationLoad_Error

10 records successfully captured without any ado and in the same time. Absolutely no difference in time.

Task 3 – Tool – SSIS, Destination – SQL Server

Source still remains the same. Here comes the real kicker even before we get into comparison. If we set-up the DFT with a simple ‘OLE DB Source’ and ‘OLE DB Destination’ components and start the package, it is bound to fail as shown below –

5_1M_SQLDestinationLoad_Package_Error.png

Out-of-the box error capturing just doesn’t exist. Here is what I have done to capture the error records using the 3 DFT technique. This is how it works. Create three copies of your ‘OLE DB Destinations’. Connect your source to the first DFT. Set the properties ‘Rows per batch’ and ‘Maximum insert commit size’ values to 50000 as shown below –

7_1M_SQLDestinationLoad_Package_OLEDST_Setting

Under mappings, the columns would then be auto-mapped by name. Connect the ‘Error Output’ of the first DFT to the second DFT copy. Set the two values for that as 10000. Connect this second DFT’s error re-direction to the third DFT copy. For the final one set, the two values as 1. Rename the tasks appropriately.

Now drag in another DFT, and plug in the third copy of the DFT error output on to this new one. This time we are going to create a new destination table for capturing the error say ‘dbo.EmployeeData_Error’. The table definition is as listed below –


CREATE TABLE [dbo].[EmployeeData_Error](
[ID] [numeric](38, 0) NULL,
[NAME] [nvarchar](50) NULL,
[ErrorCode] [int] NULL,
[ErrorColumn] [int] NULL,
[DateInserted] [datetime] NULL
) ON [PRIMARY]
GO

ALTER TABLE [dbo].[EmployeeData_Error] ADD CONSTRAINT [DF_EmployeeData_Error_DateInserted] DEFAULT (getdate()) FOR [DateInserted]
GO

At the mapping level, ensure to connect the ErrorCode and ErrorColumn of the third copy of DFT wherein you are setting both values to 1, to the same columns of the table EmployeeData_Error as shown below –

10_1M_SQLDestinationLoad_Package_OLEDST_ErrorCapture.png

With this being set-up, the package is executed with no problem whatsoever as shown below-

11_1M_SQLDestinationLoad_Package_Execution

Looks hale and hearty. How about the speed of execution –

11_1M_SQLDestinationLoad_Package_Execution_Speed.png

It completed in just 14 seconds. The fastest so far in the tasks we performed. Now let’s go to the final task.

Task 4 – Tool – SSIS, Destination – Oracle

Before we get into the package set-up, one needs to make sure the Oracle Provider for OLE DB is present. If not, get the latest stable version from the following link. The package set-up would be very simple this time around. In total there would be three tasks – OLE DB Source for fetching the data, 2 OLE DB Destinations (1 for the destination and the other for capturing the error).

The OLE DB Destination for storing the target data will need to have the ‘Data access mode’ set to – ‘Table or View’ and one cannot use ‘Fast Load’ option. Below image shows the set-up that needs to be done –

12_1M_OracleDestinationLoad_Package_OLEDTSSetting.png

With that set, here is the snapshot of the package run –

16_1M_OracleDestination_Package_Execution

And the execution speed –

16_1M_OracleDestination_Package_ExecutionResults

Almost an Hour for just a million records. One whole hour for a dataset containing only two columns. I had a job wherein I was transferring about 2 million records from a much larger table and we had similar requirement to capture the error records. That package ran for almost 3 hours to complete. Anything that is not set to SQL Server as destination, SSIS performs horribly.

Conclusion-

SSIS offers very poor performance for data loads which differs from SQL Server. If you want more robustness, one needs to look outside for any third-party tool. Informatica on other hand just doesn’t care what the source and destinations are. It just works as expected giving uniform performance.

 

SSIS Lookup Gotcha – Test Fully

The task I had on my hands was simple. I had to perform lookup on a target table with my source data and get all non matched data. The data that I was fetching for the lookup from target had additional string manipulation done to obtain the required lookup data. I then went on to write a query to fetch that data and testing on a sample subset (using TOP 10) to check for it’s correctness and went on to implement.

I set everything up and when I ran the package the lookup was just now working. The rows that I expected to have a match were simply getting redirected to no-match. When I started debugging, it then became apparent that the query that I had used to retrieve the dataset in the lookup was an incorrect one. This time around I ran the query without the subset and the query failed to get executed showing me the actual error.

Moral of the story is one should not rely on ‘Preview’ results offered from the ‘Lookup’ transform or use a subset when sampling the data. It should be checked to see if the query works for the entire set.

Let me illustrate this with an example.

Below is a sample of Source Data –

 

FileName
Test1.txt
Test3.txt

Here is the data that I am doing a lookup on from a table say dbo.LookupTable (a simple Id and FilePath column) –

1_TargetDataSample
The string manipulation query for this is as you might have already guessed, to get FileName from the ‘FilePath’.

select Id,RIGHT(FilePath, CHARINDEX('\', REVERSE(FilePath)) -1) as 'FileName'
from dbo.LookupTable;<strong>
</strong>

I have a simple Data Flow Task which does the following –

  1. OLE_SRC – Get Source Data – Gets the source data as shown above.
  2. LKP – Get Id – Using the query above, it fetches the Id and FileName. Note that the moment you put that query and go to ‘Columns’, it will throw up an error. For the purpose of this illustration just ignore it and do the mapping.
  3. Trash Destination – A priceless open-sourced transformation from Konesans and a must  have development aid.

Here is how the package looks like after running it-

2_DataFlowTaskContents

As can be seen only 1 row gets shown as matched even though there are 2 matching rows.

The data in this sample illustration is very small but the original data that I had about 100k records and it was difficult to debug on why this error occurred.

 

Import and Export Wizard – Handling Dates in Flat Files

As they say there is first time for everything. Having worked on so many packages all through my career never once I had a need to import a flat file containing dates directly from SQL Server Import and Export Wizard. It also may be a case of not having the data type set as ‘Date’ in the column and rather take it as a varchar value when doing imports.

So today as part of some task I had a requirement to import data with some date columns in it. Let’s say data for my column data looks like this in a file called DateTest.txt. Note that there is a row with blank data –


StartDate,EndDate
2017-01-01,2017-01-01

2017-05-01,2017-05-01

So I opened up the SQL Server Import Wizard, set the Data Source as – Flat File Source and browsed and obtained the File as shown below –1_SQLServerImport_General.png

Now go to Advanced and set the properties for both the columns StartDate and EndDate as ‘DT_DBDATE’ which translates to date datatype of SQL Server. For more info refer to – link. The screenshot below is for ‘EndDate’. Do the same for StartDate column as well. 2_SQLServerImport_Advanced_SetDataType.png

Set the Destination to your local Database say RnD as in my case as shown below –3_SQLServerImport_Advanced_SetDestination.png

In the ‘Select Source Tables and Views’, leave it as is and click on ‘Next>’ (This will create the table by default). Leave the defaults as is in the ‘Save and Run Package’ screen, leave the defaults and click on ‘Next>’. Click on Finish in the last page of the wizard. You will see the ‘Operation Stopped’ with ‘Copying to [dbo].[DateTest]’ set to error as shown below.

4_SQLServerImport_EndOfWizard_OperationStopped.png

If you dig further in the Messages, here is what it throws up the following error-

5_SQLServerImport_EndOfWizard_OperationStopped_ErrorMessage.png

The error states –

An OLE DB record is available.  Source: “Microsoft OLE DB Provider for SQL Server”  Hresult: 0x80004005  Description: “Conversion failed when converting date and/or time from character string.”.
 (SQL Server Import and Export Wizard)

The blank values are treated as strings and that is what the error states.

Solution 1 –

Instead of setting the data type as ‘DT_DBDATE’ set is as ‘DT_DATE’ then it will pass. There are two side-effects to this –

  • The destination column would be of type datetime instead of date.
  • All the blank values will be set as ‘1899-12-30 00:00:00.000’ as can be seen below6_SQLServerImport_Solution1_Result.png

It’s not a optimal solution. If you are just importing one file with limited columns of such type and if the destination  table on which are you are importing isn’t a large table then we can go with this approach. Depending on the use case, one can then proceed to either update the values as NULL or leave it as is.

Solution 2 –

This involves creating a package out of the same operations. Now one can go about it in the traditional way i.e. open SQL Server Data Tools (Visual Studio), add new package, drag and drop a DFT, yada yada.

Instead let’s replicate the same behavior as before. How you may say. Did you know that one can fire up a ‘SQL Server Import Wizard’ from SQL Server Data Tools itself? Now before we go further, if you have been following along, the table ‘dbo.DateTest’ in your destination should be existing.

Open a SSIS Project and go to ‘PROJECT’->’SSIS Import and Export Wizard…’ as shown below –

7_SQLServerImport_Solution2_VisualStudioProject.png

The wizard looks exactly the same as fired from SQL Server. At the end you will notice the difference. Follow the same steps that you have done earlier i.e. by setting the data as ‘DT_DBDATE’. Instead of executing, it creates a package and the final window would look like this. It will create a new package called Package1.dtsx if there isn’t one already. If there is one it would create Package2.dtsx. –

8_SQLServerImport_Solution2_VisualStudioProject_EndResult.png

At this point if you run the package that gets generated automatically as is you will get the same error. Here are the changes that are to be done.

Open the Control Flow Task – ‘Data Flow Task 1’. In the Data Flow task, open the task ‘ Source – DataTest_txt’. Ensure that ‘Retain null values from the sources as null values in the data flow’ as shown below –

8_SQLServerImport_Solution2_VisualStudioProject_RetainNull.png

Secondly double-click on the ‘SourceConnectionFlatFile’ connection manager, go to Advanced and modify the data types of StartDate and EndDate to DT_STR and length 10. Below image is shown for EndDate. Do it for StartDate as well.

20_ImportExport_AdvancedConnectionManager

Since the source connection manager is changed, the DFT – ‘Source-DataText_txt’ needs a change. Double-click on the DFT and you will be presented with the changes as shown below. Accept them.21_ImportExport_RefreshFlatFileConnectionManager

Delete the connector between Flat File and the OLE DB Destination and drag in a Derived Transform in between them. Add the following two expressions as shown below –

  • Derived Column Name  – DC_StartDate ; Expression – StartDate == “” ? NULL(DT_DBDATE) : (DT_DBDATE)StartDate
  • Derived Column Name  – DC_EndDate ; Expression – EndDate == “” ? NULL(DT_DBDATE) : (DT_DBDATE)EndDate

 

17_SQLServerImport_Solution2_DerivedColumnTransform.png

Connect the Derived Transform output to the OLE DB Destination and set the mappings with the newly transformed columns. 24_ImportExport_OLEDBDestinationMapping

In addition to that set the ‘Keep Nulls’ property to yes.

23_ImportExport_OLEDBDestination

That’s it. Now execute the package. All the three records would now get successfully

25_ImportExport_ExecutionResult

Data gets transferred with the blank values retained as NULL values as shown below –26_ImportExport_ExecutionResultTable

To summarize the solution, basically out-of-the box Import/Export wizard will not work with getting NULL values to date columns. Here are the changes to be done –

  • Set all the date columns for which you would want to retain NULL for blanks as String values
  • Add a Derived Transformation to change the data type of the data to DT_DBDATE.
  • Set the retain null property at both source and destination to yes.

Solution 3 –

The ideal solution should be the one wherein one can use DT_DBDATE at the source itself and it should go through to destination. For some reason I have been getting strange errors while doing it as shown below –

Error:Year, Month, and Day parameters describe an un-representable DateTime.

. I am still working on it. Once I get a better solution, will post it here.

 

 

Informatica 101 – How it compares to SSIS

Recently as part of my project I had a chance to finally get a hands on Informatica and how it does the standard ETL. The work that I was given was mostly to validate the counts by opening up the transformations and the actual counts that were obtained after the workflows were run. It really piqued my interest and I wanted to get a real hang of it.

So I took up the course ‘Informatica PowerCenter 9.X Developer & Admin’ in Edureka thanks to sponsorship by my current company. Over 2 weeks and with 3 hrs per week I felt I have grasped sufficient information. We had a good trainer that talked us through various aspects of the tool. He kept a use case, explained us what is it we are trying to achieve and then give a walk-through of the whole process which I could then replicate on my own.

At the end of the course we were given a small project to load a data warehouse with some dimensions of type 1 and some of type 2 into a single fact table which I completed comfortably.

Having over a decade of experience in SSIS, my mind was constantly comparing both and at the moment I feel performing certain simple tasks in Informatica are very hard compared to SSIS. So here goes some of the mental notes –

Tool Overview –

  • You need, brace for it, 4 separate tools to develop, deploy and monitor the ‘workflows’ a.k.a packages in SSIS.
    • Informatica PowerCentre Repository Manager- The place where you need to create the project similar to how one creates project in SSIS. That’s where it ends though. One cannot do any development here. This is the tool that one mainly uses to either import/export objects. Additionally I can see that it also gives a detailed info on various objects (will talk about it in future posts) such as Data Type, Precision, Key Type etc.
    • Informatica PowerCentre Designer – This is the tool where the actual development happens. Here is where you create your sources, destination, transformations/mappings. The end product out of designer is a mapplet which is a culmination of Source, Transformation and Destination. In SSIS parlance, developing a package with only Data Flow tab available.
    • Informatica PowerCentre Workflow Manager – All the mappings that were created in the designer need to be brought into the workflow. Prior to that we would first create a session, select the appropriate mapping. In SSIS parlance, one can say having just the Control Flow tab available.
    • Informatica PowerCentre Monitor – Once you create the workflow, here is where you need to connect and run the workflows, monitor the output, restart if needed. This plays the role of SQL Server Agent.
  • It is a metadata driven tool i.e. when the tool is installed you would need to specify the database on which it needs to be hosted where it creates all the tables, stored procedures, views etc. This is called Informatica PowerCenter Administrator console.

Dislikes –

  • No ‘Test Connection’ – Unlike in SSIS where sources and destinations are intrinsic to the package, here they are separate components of the project. Starting from SSIS 2012 with the advent of Project Parameters, this feature should sound familiar. The pain point though is that when you create OLE DB Sources/ Destinations there is no ‘Test Connection’ button to validate.
  • Re-creation of Source/Destination – As I was detailing earlier, your core development happens in PowerCenter Designer where. Once you are done you need to open the PowerCenter Workflow, create a session for the mapplet from Designer. Here again when it comes to OLE DB Sources/ Destination you would need to create again! It doesn’t come right away.
  • Disconnect-Reconnect – Let’s say you start with a mapplet. You create a workflow for it. So you will have two tools open – Designer as well as a Workflow. From the Workflow Manager you execute it and that’s about it. Let’s say you start to work on new mapplet. From the Workflow Manager you would need to disconnect and then reconnect to see the newly created workflow.
  • Dynamic File Naming – It’s just so hard. In SSIS, it’s all about getting the ConnectionString attached to a variable and that’s about. Here you would need at least 4 variables to do that.
  • No Loop Containers – There is absolutely no looping containers such as For Loop, Foreach Loop containers available. At this moment I don’t even know what the replacements are.
  • No Stored Procedure data as a source – Out of the box this just can’t be done. Again like above it doesn’t have a straightforward way.
  • No Data Viewers

 

Likes –

  • Component Re-usability – One can create a component and have it re-used across various mapping. This is what they call Transformations.
  • Ability to connect to any data source – Unlike SSIS, which is a beauty when it comes to SQL Server Destination and a beast when dealing with others, Informatica makes it a breeze no matter what source and destinations there are

The journey has just begun. In my next post I will give a detail on how to start one simple package and get it executed in Informatica in comparison with SSIS. Meanwhile I am also looking to see who I can get to work on Looping / Stored Procedure.

SSIS-Reading datetime stamped flat files

One of the most often asked interview question that I have encountered goes like this –

I have list of flat files in a folder with datetime stamp in it and  I would like to read only current month’s file. How would you do it? A similar variation of the question was posed in StackOverFlow and I thought why not just write a post on it. Though this particular question slightly differs from the problem statement i.e. in this instance the ask is to iterate through all the files from oldest time stamped file to the newest time-stamped file which is the logic I have written for.

At the high level, idea is to –

  1. Get all the files present in the directory.
  2. Loop through each file and store it’s metadata in a dictionary. Basically dictionary is a collections object that stores values as a Key/Value pair. In our case the Key would be ‘Date Time’ part of the file name and the Value would be full file path of the file name.
  3. Load all the Keys from the dictionary to a List.
  4. Order the List values.
  5. Loop through each value in the List and get the corresponding Value of that Key from Dictionary and store them in another List say orderedFiles.
  6. Assign the orderedFiles value to a SSIS object.
  7. Use this SSIS object as the new reference to loop in the For Each Loop container by using the Enumerator ‘Foreach From Variable Enumerator’

The Steps(1) – (6) are achieved via Script Task transformation. Let’s see how this works.

Firstly I created bunch of files with time-stamp in the suffix of the form YYYY_MM_DD.txt as shown below –

0_Sample Files.png

Create a New Package and drag Script Task Transform. Create three variables as shown below –

7_Variables Declaration

  • objOrderedFiles – Stores the list of files containing file names in a sorted order i.e oldest to newest.
  • strFilePath – Path where the files are stored.
  • strFullFilePath – Full file path of the file.

In my Script Task editor, I passed ‘strFilesPath’ as a ReadOnlyVariable and ‘objOrderedFiles’ as a ReadWriteVariable as shown below-

4_PassingVariablesToScriptTask

Now click on the ‘Edit Script’ button. In the Window that opens, we need to ensure that the target framework is 3.5 for our code to work. Click on ‘Project’ tab and select the last menu option that ends with ‘Properties…’ as shown below –

1_Changing Reference

In Application tab under Target Framework select the option ‘.NET Framework 3.5’ as shown below –

2_Changing Reference

You will get the following dialog box, click ‘Yes’

3_Changing Reference

Now in the ‘ScriptMain.cs’ file, under the main() paste the following code –


string strFilesPath = Dts.Variables[&quot;User::strFilesPath&quot;].Value.ToString();

Dictionary&lt;DateTime, string&gt; dictionary = new Dictionary&lt;DateTime, string&gt;();
List&lt;string&gt; orderedFiles = new List&lt;string&gt;();
string[] files = {} ;

// get the all files metadata from the directory
if(Directory.Exists(strFilesPath))
files = Directory.GetFiles(strFilesPath, &quot;bookreview_daily_*.txt&quot;, SearchOption.AllDirectories);

string strFilePath;
string strFileName;
DateTime dt;

// get all the file names and datetime present in it as a Key Value pair in the Dictionary
foreach (string file in files)
{
strFilePath = file;
strFileName = Path.GetFileNameWithoutExtension(file);

// this below gets the file name in required date time format by getting the suffix
// into datetime. ex - gets the value 2016-01-02 as date from bookreview_daily_2016_01_02
dt = DateTime.Parse(strFileName.Replace(&quot;bookreview_daily_&quot;, &quot;&quot;).Replace(&quot;_&quot;, &quot;-&quot;));

dictionary.Add(dt, strFilePath);
}

// get keys into list
var list = dictionary.Keys.ToList();
// sort the list
list.Sort();

foreach (var val in list)
{
// loop through the list and add the ordered file data
// into the list
orderedFiles.Add(dictionary[val]);
}

//assign it to the user defined object
Dts.Variables[&quot;User::objOrderedFiles&quot;].Value = orderedFiles;

Dts.TaskResult = (int)ScriptResults.Success;

Drag the ‘Foreach loop transform’ and connect it to the above script task. Double-click the transform to edit the options. Go to Collection and set the Enumeration Type as ‘Foreach from Variable Enumerator’ and under ‘Variable’ chose User::objOrderedFiles as seen below –

5_ForEachLoop container Choosing Enumerator

In the Variable Mappings window, select the variable User::strFullFilePath as seen below –

6_ForEachLoop container assigning variable.png

That’s it. Now the all the files that gets looped would be from the ‘Oldest to Newest’. I put another Script Task transformation that just pops up the file that is being read.

Here is the outline of the package –

8_PackageOutline

Once you run, you see that the first file that’s popped is the oldest one –

9_PackageResult

Now, if the question is about only getting the current months file, then we just need to modify the logic during the looping of the files in the list to check if it’s a current month’s date and pretty much everything else is the same.