Total Pageviews

Wednesday, 29 February 2012

Testing in ETL, Data-Centric Projects


Testing in ETL, Data-Centric Projects
Testing is an investigation process that is conducted to check the quality of the product. Product can either be an applications or data. The quality of the data can only be determined by checking data against some existing standards by following a set of processes. By doing so, you find out the symptoms in form of invalid/incorrect data that happened because of erroneous processes. So the data-centric testing results in achieving high quality data by getting the erroneous processes fixed.
This article highlights the types of data-centric testing approaches and discovers what each testing approach uncovers.
Types of Testing
  • Application Testing: The focus of this article is data-centric testing so we’ll not discuss application testing here.
  • Data-Centric Testing: Data-centric testing revolves around testing quality of the data. The objective of the data-centric testing is to ensure valid and correct data is in the system. Following are the couple of reasons that cause the requirement of performing data-centric testing:
    • ETL Processes/Data Movement: When you apply ETL processes on source database, and transform and load data in the target database
    • System Migration/Upgrade: When you migrate your database from one database to another or you upgrade an existing system where the database is currently running.
Data-Centric Testing
The data-centric testing validates data using the following approaches:
  • Technical Testing: Technical testing ensures that the data is moved, copied, or loaded from the source system to target system correctly and completely. Technical testing is performed by comparing the target data against the source data. Following is a list of functions that can be performed under technical testing:
    • Checksum Comparison: The data-centric testing makes use of checksum approach to discover errors. Checksum can be performed on source and target databases in n number of ways such as counting the number of rows, and adding the data of a column. Later the result of checksum calculated on source database is compared against the checksum calculated on the target database.
      For example, row count compares the number of rows in the target database with the number of corresponding rows in the source database. Or the target database may contain the summarized annual data for monthly salaries in the source database. So the target database should contain sum of the monthly salaries paid within a year for each year.
    • Domain comparison: The domain list in the target database is compared against the corresponding domain list in the source database. For example, the source system has 100 employees and the target system also has 100 employees but it does not guarantee that the employees in both the source and target domain lists are same unless compared. In Domain comparison, employee names in the target domain list are compared against the employee names in the source domain list.
    • Multi-value comparison: Similar to List comparison, multi-value comparison compares the whole record or the critical columns in a record against the corresponding values in the source system.
      For example, the domain comparison reports that all the employees are same and checksum comparison reports that the salaries on source and target match, but still it does not guarantee of a valid record because the data movement may result in assigning wrong employee name to the salary entry. Such issues are not discovered by Checksum comparison and Domain comparison. The multi-value comparison discovers such issues by comparing the key attributes of each entity in source and target databases.
  • Business Testing: Business testing is done to ensure that the data fulfills the requirements of the business. Data may have been moved, copied, or loaded completely and accurately, and technical testing does not report any issue still there are chances that the system still contains the invalid data. To ensure high quality data, the data is evaluated against the business rules.
For example, value of certain fields such as salary and commission cannot be less than zero. These types of errors can occur because of data manipulation between the source and target systems. Unless you test data for any such errors, the quality of the data is not guaranteed.
For every business, the list of the business requirements is different. An exhaustive list of business rules against which the data is compared, ensures high quality data.
  • Reconciliation: Reconciliation ensures that the data in the target system is in agreement with the overall system requirements. Following are the couple of examples of how the reconciliation helps in achieving high quality data:
    • Internal reconciliation: In this type of reconciliation, the data is compared within the system against the corresponding data set. For example shipping would also always be less than or equal to the orders. If the shipping ever exceeds the orders then it means data is invalid.
    • External reconciliation: In this type of reconciliation, data in system is compared against its counterpart in other systems. For example, in a module or in an application, number of employees can never be more than the number of employees in HR Employee database. Because HR Employee database is the master database that keeps record of all the employees. If such a situation occurs where the number of employees anywhere in the system is more than the HR Employee database, then the data is invalid.
My company developed the ICE± product, which targets to accomplish all types of testing mentioned in this article. For more information, visit www.iCEdq.com

Oracle recycle bin flashback table feature
When you do ‘Select * from tab;’ some times, you will be surprised to see some tables with garbage names. Welcome to the world of Oracle Recycle Bin Feature. Because of this feature, Oracle saves the dropped table in recycle bins until you clear it.
1. Empty recycle bin use the command:
PURGE RECYCLEBIN;
2. To drop the table without storing the table in recycle bin use:
DROP TABLE employee PURGE;
3. To restore the table from the recycle bin use:
FLASHBACK TABLE employee TO BEFORE DROP;
So don’t forget to clean your database /Schema once in a while.


Pitfalls of type II dimension

Type II dimension has been popularized by R Kimball. it has become so popular that in any interview related to data warehouse, the interviewer will surely ask to explain the concepts. And chances are that if you don’t know; they will laugh at your ignorance and reject you.
Here’s your chance to laugh at them..
If you read this article, probably you will end up knowing something more than them. This is not because you will find the definition of type II dimension, but for an entirely different reason.

To be continued…
To clearly explain the pitfalls of Type II dimension, let’s take an example. In the example, there are three tables, that is, DIM_Instrument, FACT_Trade, and FACT_Settlement. Each of the tables contains data as shown below:
DIM_Instrument table
DIM_Instrument

In the Dim_Instrument table, the property of instrument IBM changes from X to Y. So to maintain type II dimensions, a new entry is added in the table by updating the status of current entry to obsolete (denoted by ‘O’) and adding date in the TODT column as well. In the new entry, the ToDT column is NULL and the status is current (denoted by ‘C’).
FACT_Trade table
FACT_Trade

The trade table contains information about just one trade and that was executed on April 29th, 2011, which means it was processed for InstrumentKey ‘1’ as Instrumentkey ‘2’ did not exist on April 29th, 2011.
FACT_Settlement table
FACT_Settlement

Generally, it takes three days for a trade to settle. So the trade that was executed on 29th April 2011, got settled only on 2-MAY-2011. During this period the property of instrument ‘IBM’ changed on May 1st 2011, a new entry was made in the DIM_Instrument table for Instrument IBM, which incremented the instrumentKey to 2.
Now in the settlement table, the instrument key against the same trade is different, which can cause the issues/concerns that are described using various scenarios.
Scenario 1:
Get data for each trade and settlement for the current instrument, using the following query:

SELECT T.INSTRUMENTKEY, T.TradeDate as Sec_Date
From
DIM_INSTRUMENT I, FACT_TRADE T
WHERE
T.InstrumentKey=I.InstrumentKey
and I.Status=’C’
UNION ALL
SELECT S.INSTRUMENTKEY, S.SettlementDt as Sec_Date
From
DIM_INSTRUMENT I, FACT_SETTLEMENT S
WHERE
S.InstrumentKey=I.InstrumentKey
and I.Status=’C';

Scenario1
The output does not show any data from Fact_Trade table. Because the current record in Dim_instrument is with InstrumentKey as ’2 and the Fact_Trade table does not contain InstrumentKey ’2.
It happens because the instrumentkey is changed between trade execution and trade settlement. Though, the settlement is done against the trade processed on April 29th, 2011 but with the change in InstrumentKey in between, there is no way to retrieve it.
Scenario 2:
Rather than querying data for current instrument, get data for all the instrument records. You can do so by using the following query:
SELECT T.INSTRUMENTKEY, T.TradeDate as Sec_Date
From
DIM_INSTRUMENT I, FACT_TRADE T
WHERE
T.InstrumentKey=I.InstrumentKey
UNION ALL
SELECT S.INSTRUMENTKEY, S.SettlementDt as Sec_Date
From
DIM_INSTRUMENT I, FACT_SETTLEMENT S
WHERE
S.InstrumentKey=I.InstrumentKey

The above query returns the following output:
Scenario2OP1

If you analyze the output closely, it returns the trade information for InstrumentKey ’1 but no settlement information for InstrumentKey ’1. Similarly for InstrumentKey ’2 there is no trade information but the settlement information. This output can be best viewed as the following:
Scenario2OP2

For an instrument when trade exists, settlement does not exist and vice versa.
Scenario 3:
Maintain a relationship between two fact tables by maintaining a common key (i.e. TradeID) and join both the tables based on this common key. In this case, the TradeID uniquely identifies the trade and settlement data that refer to the same trade.
But this scenario results in fact-to-fact relationship, which is not recommended because of the following reasons:

  • Fact tables being the largest tables adversely impact the performance of the join.
  • It is not a practical scenario when multiple fact tables need to be joined because doing so complicates the design and is not always possible to achieve.
All the above scenarios highlight various pitfalls of Type II dimension. All the scenarios and examples used throughout this article are purely for demonstrating the highlighted pitfalls.
Posted in Data Architecture, Data Modeling, Realtime ETL

7 Responses to “Pitfalls of type II dimension”

·         http://0.gravatar.com/avatar/c679d232673464017e0d6fa39b1f0e76?s=32&d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D32&r=GSourav Guha says:
The problem mentioned is very real life. Even I faced this issue and I have a woork around for that. It is should be taken care in the data model. Add one PARENT_INSTRUMENT_KEY attribute in the DIM_INSTRUMENT table to track that both 1 and 2 are under the PARENT_INSTRUMENT_KEY. For eg. 1, A1; 2, A1… N, AN.
·         http://1.gravatar.com/avatar/d5e895f4c998481f9da5e635df336ffd?s=32&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D32&r=GSai says:
Friend can you explain what is Type II dimension? When we talk about type II dimension it is a concept woven around the with the help of tools. As Sourav mentioned it can be corrected at data model.
Problem u mentioned is clearly data model issue not concepts. What do u say?
·         http://1.gravatar.com/avatar/9aa16ccbd8897075a3104b54d030035d?s=32&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D32&r=GAdministrator says:
Sai, I’m glad you mentioned this. If we look carefully, this article tried to bring into notice the pitfalls of Type II dimensions. As mentioned earlier, type II requires the primary key change for the same identity to maintain the history. Once the primary key changes, we can very well imagine what kind of results it can produce.
As far as the solution is concerned, it can be implemented the way you want. It should not be assumed that there is no solution to the issues reported here. In fact, the solution for this has to be implemented at the Data Model level.
Sometimes the problem and the solution are so closely woven that we prefer not to look at them separately.
Hope it clarifies…

Difference between Reference Data and Master Data

It is not unusual for people to use ‘Reference Data’ and ‘Master Data’ interchangeably without understanding the differences.
Lets try to understand the differences with an example of sales transaction.

A sales transaction contains information like….
Store,
Products Sold,
Sales Person,
Store Name,
Sales Date,
Customer,
Price,
Quantity,
etc.

Attributes from the above example can be separated into two types: Factual (transactional) and Dimensional information
Price and Quantity are measurable attributes of a transaction.
Store, Products Sold, Sales Person, Store Name, Sales Date, and Customer are dimensional attributes of a transaction.

We can see that the dimensional data is already embedded in the transaction. And with dimensional attributes we can successfully complete the transaction.Dimensional data that directly participates in a transaction is master data.
But is the list of dimensional attributes in the transaction complete?

Asking few analytical questions can help us discover the answer.
     -What is the Male to Female ratio of customers doing purchase at the store?
     -What type of products are customers buying? Ex: Electronic, Computers, Toys
     -What type of Store is it?  Ex: Web store, Brick & Mortar, Telesales, Catalog Sales

The above questions cannot be answered by attributes in the transaction. These dimensional data is missing in the transactions.  This missing dimensional data that does not directly participate in transaction but are attributes of the dimension is reference data.

Why it is important for an ETL person to understand the differences? Well once the  ‘Reference Data Management’ (RDM) was popular then suddenly in last few years there is this new word ‘Master Data Management’ (MDM). These words mean different things and they have significant implication on how they are managed. But that will be a topic of discussion for some future post!  I hope this article will help clear atleast some confusion.


Loading & testing fact/transactional/balances (data), which is valid between dates!
This is going to be a very interesting topic for ETL & Data modelers who design processes/tables to load fact or transactional data which keeps on changing between dates. ex: prices of shares, Company ratings, etc.
variable_bond_interest.JPG
The table above shows an entity in the source system that contains time variant values but they don’t change daily. The values are valid over a period of time; then they change. variable_bond_interest_fct1.JPG
1 .What the table structure should be used in the data warehouse?
Maybe Ralph Kimball or Bill Inmon can come with better data model! :-)But for ETL developers or ETL leads the decision is already made so lets look for a solution.
2. What should be the ETL design to load such a structure?
Design A
  • There is one to one relationship between the source row and the target row.
  • There is a CURRENT_FLAG attribute, that means every time the ETL process get a new value it has add a new row with current flag and go to the previous row and retire it. Now this step is a very costly ETL step it will slow down the ETL process.
  • From the report writer issue this model is a major challange to use. Because what if the report wants a rate which is not current. Imagine the complex query.
Design B
  • In this design the sanpshot of the source table is taken every day.
  • The ETL is very easy. But can you imagine the size of fact table when the source which has more than 1 million rows in the source table. (1 million x 365 days = ? rows per year). And what if the change in values are in hours or minutes?
  • But you have a very happy user who can write SQL reports very easily.
Design C
  • Can there be a comprimise. How about using from date (time) – to date (time)! The report write can simply provide a date (time) and the straight SQL can return a value/row that was valid at that moment.
  • However the ETL is indeed complex as the A model. Because while the current row will be from current date to- infinity. The previous row has to be retired to from date to todays date -1.
  • This kind of ETL coding also creates lots of testing issues as you want to make sure that for nay given date and time only one instance of the row exists (for the primary key).
Which design is better, I have used all depending on the situtation.
3. What should be the unit test plan?
There are various cases where the ETL can miss and when planning for test cases and your plan should be to precisely test those. Here are some examples of test plans
a. There should be only one value for a given date/date time
b. During the initial load when the data is available for multiple days the process should go sequential and create snapshots/ranges correctly.
c. At any given time there should be only one current row.
d. etc
NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio, MS SQL Server, RDB, etc

ETL delta logic & de-normalization of data model.

It is a normal practice in data warehouse to de normalizes (Or once auto corrected as demoralize) as the data model for performance. I am not going to discuss the benefits vs. issues with de-normalization.  As by the time it comes to the ETL guy the fate of the model is already decided.
Let’s look at the model in the source side, which is perfectly normalized. 
normalized_emp.JPG      


Now let’s look at the de normalized model on the target side. 
denormalized_emp.JPG      


Next lets think of delta logic for loading of the dim_employee table. Ideally you would only check changes in the employee table. Then if there is any changes after the last load date time ; then get those rows from ref_employee and do the lookup to get the department & the designation and load it into the target table.

The issue with this delta logic is that it has not considered the effect of de normalization of employee table on the target side. If you carefully look at the two de normalized attributes dept_name and emp_designation_desc, the ETL process will miss any changes in the parent tables, so only new employees or updated employee will get the new definition of department & designation. And any employee that has not been updated in the source side will still have the same dept_name & emp_designation_desc. This is wrong.

The reason it is wrong is the ETL delta logic only picked the row from the employee table when it changed and ignored the changes in the dept & designation tables. The truth of the matter is, ” For any de normalized target table data (affected rows) should be re-captured from the source, any time there is change in the driving/core table as well as when there is change in any parent tables to which the driving table refers to.” In this case, even if there is change in department or designation table, all the rows affected on the employee tables should be re-processed.
It might seem very simple, but ETL developers/designers/modelers always miss this point. Also once developed it is very difficult to catch.
The next question is how you would catch the affected rows. Well there are ways to write SQL that combine the three tables (in this case) and treat them as one single entity and the pull rows based on the any update_dttm greater than the last ETL run. Figure out the SQL… 

Types data elements and entities (Tables) for ETL.

It is important for an ETL developer to understand the types of tables and data, to intelligently design ETL processes. Once the common types objects are understood, reusable templates for ETL can be developed, regardless of business logic. This will greatly improve the efficiency of an ETL developer.
1. Reference data

2. Dimensional data (master data)

3. Transactional data

4. Transactions

5. Balances

6. Summary/Aggregations

7. Snapshots

8. Staging

9. Out triggers/mini dimensions

10. Log tables

11. Meta data tables

12. Security tables

13. Configuration tables

 

Simulating Oracle Sequences in Sybase & SQL Server

Programmatic control is lost when identity columns are used in Sybase and SQL Server. I do not recommend using Identity columns to create surrogate keys during ETL process. There are many more reasons for that. Oracle has the sequence feature which is used extensively by Oracle programmers. I have no clue why other vendors are not providing the same. This custom code has been used extensively by me and thoroughly tested. I ran multiple processes simultaneously to check if there is deadlock and also made sure that the process returns different sequences to different client process.
Notes: -

1. The table should have ‘ROW LEVEL LOCKING’

2. The sequence generator process is stateless (See more details in Object Oriented Programming)

3. Create one row for each target table in the sequence master table. Do not try to use one sequence for multiple tables. It will work but probably is not a good idea.

Step 1: -Create a table with following structure.

CREATE TABLE sequence_master (
sequence_nm varchar (55) NOT NULL ,
sequence_num integer NOT NULL
)
GO

Step 2: -Create a stored procedure that will return the next sequence.

CREATE PROCEDURE p_get_next_sequence
@sequence_name varchar(100)
AS
BEGIN
DECLARE @sequence_num INTEGER
— Returns an error if sequence row is entered into the table.
SET @sequence_num = -1

UPDATE sequence_master
SET @sequence_num = sequence_num = sequence_num + 1
WHERE Sequence_name = @sequence_name

RETURN @sequence_num
END
GO

Multiple executions ETL process against same set of data.

Every ETL designer, developer & tester should always ask this question…”What will happen, if I run the ETL process multiple times, against the same data set?”
Answer: 1. I get the same result set.
Answer: 2. I get multiple result set.

If you go back to the original article on What is ETL & What ETL is not! You will immediately come to the conclusion that Answer 2 is incorrect, as ETL is not allowed to create data.
Why will the process run more than once against the same set of data? Many reasons, example most common being operators mistake, accidental kickoff, old set of data file remaining in the directory, staging table loaded more than once, intentional rerun of ETL process after correction of some data in source data set, etc. Without going into further details, I would advise ETL folks to always include in your process ways to prevent it from happening by one or more combinations of following methods…
1. Identify the primary key (logical/physical) and put update else insert logic.
2. Deleting the target data set before processing again (based on logical/physical primary key)
3. Preventing occurrences of multiple runs by flagging processed dates
4. Marking processed records with processed flags after commit
5. Prevention of multiple loads in the staging area
6. identifying duplicate records in stage area before the data gets processed
7. more…

So do these experiments in the development or test environment run the ETL process more than once, check the result! If you get the result 2 (copies of rows, with no way to distinguish or retiring the old rows)
The designer or the developer is wrong & if the process as passed the QA or testing then the tester is wrong.

Bottom line:
A test case to check multiple runs is must in life cycle of an ETL process.


Data staging table / area design.

This could be long topic of discussion.  Following are the main issues I would like to discuss on staging table /database design.
1. Why staging area is needed?
Unlike OLTP systems that create their own data through an user interface data warehouses source their data from other systems. There is physical data movement from source database to data warehouse database. Staging area is primarily designed to serve as intermediate resting place for data before it is processed and integrated into the target data warehouse. This staging are serves many purpose above and beyond the primary function
a. The data is most consistent with the source. It is devoid of any transformation or has only minor format changes.
b. The staging area in a relation database can be read/ scanned/ queried using SQL without the need of logging into the source system or reading files (text/xml/binary).
c. It is a prime location for validating data quality from source or auditing and tracking down data issues.
d. Staging area acts as a repository for historical data if not truncated
e. Etc.

2. What is the difference between staging area as compared to other areas of data warehouse?
a. Normally tables in any relational database are relational. Normally tables are not stand alone. Tables have relationship with at least one or more tables. But the staging area greatly differs in this aspect. The tables are random in nature. They are more batch oriented. They are staged in the hope that in the next phase of load there will be a process that will identify the relationship with other tables and during such load a relationship will be established.
 

3. What should the staging table look like?
Staging Table Format
a. The key shown is a meaningless surrogate key but still it has been added the reason being; as may times the data coming from a source has no unique identifier or some times the unique identifier is a composite key; in such cases when data issue is found with any of the row it is very difficult to identify the particular row or even mention it. When a unique row num is assigned to each row in the staging table it becomes really easy to reference it.
b. Various dates have added to the table; please refer date discussion here.
c. The data type has been kept as string because this data type ensures that a bad format or wrong data type row will be at least populated in the stage table for further analysis or follow-up.
d. Source system column has bee added to keep a data reference so that next process step can use this value and can have dynamic behavior during process based on the source system. Also it supports reuse of table, data partitioning etc.
e. Note the table has source as table qualifier as prefix this distinguishes the table from other source system. Example customer from another system called MKT.
d. Other columns can be added example processed flag to indicate if the row has been processed by the down stream application. It also provides incremental restart abilities for the down stream process.  Also exception flag can be added to the table to indicate that while processing the table an exception or error was raised hence the row is not processed.
4. Which design to choose?
a. Should the table be truncated and loaded?
b. Should the table will be append only?
c. Should default data type to be left as alpha numeric string (VARCHAR)?
d. Should constraints be enforced?
e. Should there be a primary Key?

It is normally based on the situation but if not sure or you don’t want to think then design suggested here should more than suffice your requirement

Slow Running ETL process (read side) & Table Statistics (Oracle).

Sometimes an ETL process runs considerably slow speed. During test for the small result set it might fly but when a million rows are applied the performance takes a nosedive. There can be many reasons for slow ETL process.  The process can be slow because read, transformation, load. Lets eliminate the transformation and load for the sake of discussion.
For ETL process to be slow on read side here are some reasons .1. No indexes on joins and/or ‘where clause’ 2. Query badly written. 3. Source not analyzed.  Out of these three lets rule out 1 & 2.
In the past most of the databases had RULE based optimizer in the INIT.ORA file, but with new development and specially Data warehouses ‘CHOOSE’ optimizer is preferred. With ‘CHOOSE’ option the query uses COST based optimizer if the statistics are available for the tables in the query.

There are two methods to gather statistics 1. DBMS_STATS package, 2. ANALYZE command. Oracle does not recommend ANALYZE command going forward for various reasons (its a command, cannot analyze external tables, it gathers stats that are not essential, inaccurate stats on partitioned tables and indexes, in future ANALYZE will not support cost-based optimizer, no Monitoring for stale statistics, etc.)
DBMS_STATS.GATHER_SCHEMA_STATS (ownname =>’DWH’, options =>’GATHER AUTO’);
This Package will gather all necessary statistics automatically (there is no need to write a process to check for tables that have stale or no stats). Oracle implicitly determines which objects need new statistics, and determines how to gather those statistics. So once it is put in the ETL flow you can sleep at home and everybody will be happy.

Bottom Line:
After major data load the tables should be analyzed with an automated process. Also please do not use ANALYZE , as in future it will not collect statistics needed for COB. The code can be found here.

1This package should be used as the part of ETL workflow to make sure as batch processes are run the benefit of statistics are available to the next process in the que.
Also ETL programmers can call this package within the ETL process/procedure in the begening so that if the tables required for process are not analyzed they will be analyzed automatically. Also if the procedure is going to modify lot of data in the table then the package can be called just before the end of the procedure.

Data integration basics-Dimension Confirmation.

The purpose of this topic is to establish the basics for design of ETL processes. With out the understanding an ETL process for data integration cannot be designed or developed. 
A database can have multiple sources. Multiple sources may contain a data set of entirely different subject areas, but some data set will intersect. Example Sales data and salary data will have employee as the common set. 

Between two or more sources, the subjects, entities or even attributes can be common. So can we integrate the data easily? Mathematically it seems very easy, but the real world is not just about numbers or exact same string values. Every thing can be similar or same but may be not represented in the exact same manner in all the sources. The differences in representation of same information and facts between two or more sources can create some of the most interesting challenges in Data Integration.
Data integration: -The first step in data integration is identification of common elements
1. Identify the common entities.
Example, Employee dimension can come from Sales system, Payroll System, etc. Products can come from manufacturing, sales, purchase etc. Once the common entity is identified its definition should be standardized. Example, does employee include fulltime employees as well as temporary workers?

2. Identify the common attributes.
What are the attributes that are common to employee, 1st name, 2nd name, last name, date of joining, etc? Each attribute should be defined.

3. Identify the common values
Same information can be represented in different forms in multiple source system. Example, male sex, can be represented as ‘M’ or ‘1
or ‘male’ or some thing else by each source system. A common representation must be decided (example, ‘Male’). Also if necessary a finite set of values should be established. Example the employee sex = (‘Male’,’ Female’) and it will not allow more than these two values.

The second step is the identification of Data Steward who will own the responsibility and ownership for particular set data elements.
The third step is to design an ETL process to integrate the data into the target.  This is the most important area in the implementation of an ETL process for data integration. This topic will be discussed in more detailed under its own heading.
The final fourth step is to establish a process of maintenance, review & reporting of such elements.

For every rule there is an exception; for each exception there are more exceptions…

To implement an ETL process there are many steps that are followed. One such step is creating a mapping document. This mapping document describes the data mapping between the source systems and the target and the rules of data transformation.
Ex. Table / column map between source and target, rules to identify unique rows, not null attributes, unique values, and range of a attributes, transformations rules, etc.

Without going into further details of the document, lets analyze the very next step. It seems obvious and natural to start development of the of the ETL process. The ETL developer is all fired up and comes up with a design document and starts developing, few days time the code is ready for data loading.
But unexpectedly (?) the code starts having issues every few days. Issues are found and fixed. And then it fails again. What’s happening? Analysis was done properly; rules were chalked out & implemented according to the mapping document. But why are issues popping up?  Was something missed?
Maybe not! Isn’t it, normal to have more issues in the initial lifetime of the processes?

Maybe Yes! You have surely missed ‘Source System Data Profiling’. The business analyst has told you rules as the how the data is structured in the source system and how it is supposed to behave; but he/she has not told you the ‘buts and ifs’ called as EXCEPTIONS for those rules.

To be realistic it is not possible for anyone to just read you all rules and exceptions like a parrot. You have to collaborate and dig the truth. The actual choice is yours, to do data profiling on the source system and try to break all the rules told by the analyst. Or you can choose to wait for the process to go live and then wakeup every night as the load fails.  If you are lucky enough you deal with an unhappy user every morning you go to the office.

Make the right choice; don’t miss ‘Source system data profiling’ before actually righting a single line of code. Question every rule. Try to find exception to the rules. There must be at least 20 tables. One table on an average will have 30 columns; each column will have on an average 100k values. If you make matrix of number of tables * columns * data values, it will give the number of reasons the why your assumptions may be wrong.   It’s like unit testing source data even without loading. There is a reason why machines alone cannot do your job; there is reason why IT jobs are more paying.

Remember, ‘for every rule there is an exception; for each exception there are more exceptions…’

ETL Startegy to store data validation rules

Every time there is movement of data the results have to be tested against the expected results. For every ETL process, test conditions for testing data are defined before/during design and development phase itself.  Some that are missed can be added later on.
Various test conditions are used to validate data when the ETL process is migrated from DEV-to->QA-to->PRD. These test conditions are can exists in the developer’s/tester’s mind /documented in word or excel. With time the test conditions either lost ignored or scattered all around to be really useful.
In production if the ETL process runs successfully without error is a good thing. But it does not really mean anything. You still need rules to validate data processed by ETL. At this point you need data validation rules again!
A better ETL strategy is to store the ETL business rules in a RULES table by target table, source system. These rules can be in SQL text. This will create a repository of all the rules in a single location which can be called by any ETL process/ auditor at any phase of the project life cycle.
There is also no need to re-write /rethink rules. Any or all of these rules can be made optional, tolerances can be defined, called immediately after the process is run or data can be audited at leisure.
This Data validation /auditing system will basically contain
A table that contains the rules,
A process to call is dynamically and
A table to store the results from the execution of the rules

Benefits:
Rules can be added dynamically with no cange to code.
Rules are stored permanantly.
Tolerance level can be changed with ever changing the code
Biz rules can be added or validated by business experts without worring about the ETL code.

Introduction to Error and exception management.

ETL is all about transportation, transformation and organizing of data. Of anytime something moves (as a matter of fact even if you are perfectly stationary and items around moves) accidents are bound to happen. So any ETL specialist believes that their code is perfect and nothing can happen obviously lives in a fool’s paradise.
The next obvious thing is to design to manage accidents, like making a safer car or a factory. And as an ETL specialist if you don’t do it you are no different then others. As in any country there are laws for accidents and accident due to criminal negligence. Later being the worst.
How many times I have seen people putting ETL code into production without actually designing processes to prevent, manage or report accidents. Writing code is one thing writing production worthy code is another. Do ask yourself or your developers, “Is the code production worthy?”
Next
ERRORS:
A programmatic error that causes the the program to fail or makes the program run for uncontrolled time frame.
EXCEPTIONS: A program/code written to handle expected or unexpected errors gracefully so that the program continues run with logging the error and bypassing the erroneous conditions or even logging the error and gracefully exiting with error message.

More detailed description will come with topic…. ‘Unhandled exceptions results in Errors’.
Note: The topic on error and exceptions is relevant to Informatica, Data Stage, Abinitio, Oracle warehouse builder, PLSQL, SQLLDR, Transact SQL or any ETL other tools.

Find me the attribute!……….. What attribute?

An attribute with a context is a meaningless attribute. (Even if it has a definition associated with it).
One of the interesting phases in ETL is the data mapping exercise. By the time data-mapping phase starts the target model and source model is already defined. But sometimes a report or excel file is pulled out by a user just like magician pulls a rabbit from the hat.  And declare it as a target. They will ask for the data to be mapped from the source. There is not target data model.
Probably the first thing will be to identify the list of attributes from the target and then off you go finding them in various source systems. Is this the right approach? Absolutely not! Will it work? Probably yes, but at the cost of time, money, resources and lots of unnecessary confusion. Why?
Lets take an example…. User wants to see qty of products sold by the Sales Representative (SR)…DATE, QUANTITY, PRICE , etc…
Is Sales Representative a person or a group of people?
Is ‘DATE’ Date of order, date of execution or date of shipping?
Is ‘QUANTITY’ qty of order, qty of execution or qty of shipping, also do we consider for qty of return?

Is ‘PRICE’ price of Order, price of actual execution or price of invoice?
Point is a floating data element is meaningless without the context of subject area or entity. The Attribute might itself be a combination of many other attributes from many other entities
So what the right approach?
Define a data model for target even if it is not an actual database. It will help you understand the entities to which it belongs. Then it will be easy to find similar entity/s in the source and map it to the target element.  

In short build a data-model. Before actually building start mapping the data.
Conclusion:  A attribute has no meaning unless it is attached to an entity.

ETL Data file processing questions, you must ask!

Many times file processing is taken quite lightly these days. The reason is either ignorance or sheer carelessness. Another reason being most of them being from RDBMS background they are just not aware of the intricacies of file management.
So wakeup….
Pre process
Before processing the file have you checked if the contents in header/trailer records actually match the file content?

Did the checksum of the file match to file on remote server?

Does the file contain delta, repeat/all rows?

Does the file contain delta, repeat/all rows? Are there flags for update, delete insert?
Does the file contain flags for update, delete insert?
Does the file transfer protocol supports support restart capabilities?

Have you asked data file envelope ( ex source name, date)information in the contents of data file or in the file name?

Have you asked for a header or trailer record? If yes is it in the data file itself or another file?

How do you know that the file is ready to be pulled?

How do you know there are no binary characters in the ASCII file?

If the file has to be pulled; What is the remote server name, IP address, user id and password?

If the file in same format is received from multiple sources how do you distinguish files from various source systems? (Content or by name or by location)

If the same file is processed in multiple schedule how do you distinguish between say a daily file or monthly file or yearly file?

If the file is being pulled from remote server then, will a backup of the file will be maintained on the remote server?

Is the remote file server being identified by IP address rather thana name? In that case, how does the system handle failover & change of server?

Is timestamp appended to the file name?

What are the contact details of the main and backup person?

What file transport protocol should be used?

What is the delimiter of file?

What is the expected size of file?

What is the file type binary or ASCII or ABCEDIC or something else?

What is the format of file multi line/XML/etc?

What is the format of the file?

What is the frequency of the file? Continues / daily / weekly bimonthly etc?

What is the location of the file?

What is the name of file?

What is the name of file? Is timestamp appended to the file name?

What is the source of the file?

What kind of access it provided ftp/secured/proprietary?

Who is the main contact person and backup contact for the file?

Will the file be overwritten daily in the source (remote) server?

Will the file be pushed or pulled?

In process

How do you know that the file that is being process is not a old file?

How will the file moved during processing?

What is the archive policy?

How should the process deal with deletes?

Can you avoid complete processing of file by taking a delta when compare to the previous days file?

Is the file compressed after load?

How do you know that all the file movement actually happened and not failed in any steps?

Did you check for the exit code after every step?

Did you convert the file columns in correct format?

What will you do if some of the columns are not in correct format?

Will you skip the row, replace the value or stop the process?

After how many fails will you terminate the process?

What happens if the load fails in between half load?

How will you identify the half loaded records?

How will you deal with the remaining records corrected manually?

Will you rerun the whole file to fix the problem?

What will happen if during the rerun the file is not completely processed?

Post process

Did the rows table match to that of file?

Does the Checksum match?

Did column shift occurred in the file?

Have you stored the result of the load into a table?

ETL Strategy for Reversing wrong file/data load.

We all know ‘garbage in, garbage out’. But what happens when the data provider (source system) realizes that they have provided incorrect data to the data warehouse? And now they want to correct the mistake by providing you new data! By then ETL has already loaded the data. The data is inserted, updated & deleted! Simple inserts is easy, but what about integrated data because of updates? How do you recover from such situation? Can you actually recover? What should be the ETL strategy?
Has your Data warehouse Architect/Manager/ETL Lead come up with an ETL Strategy?
NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio, MS SQL Server, RDB, Teradata, etc.

 http://www.softwaretestinghelp.com/
Test Plan Template:
(Name of the Product)
Prepared by:
(Names of Preparers)
(Date)
TABLE OF CONTENTS
1.0 INTRODUCTION
2.0 OBJECTIVES AND TASKS
2.1 Objectives
2.2 Tasks
3.0 SCOPE
4.0 Testing Strategy
4.1 Alpha Testing (Unit Testing)
4.2 System and Integration Testing
4.3 Performance and Stress Testing
4.4 User Acceptance Testing
4.5 Batch Testing
4.6 Automated Regression Testing
4.7 Beta Testing
5.0 Hardware Requirements
6.0 Environment Requirements
6.1 Main Frame
6.2 Workstation
7.0 Test Schedule
8.0 Control Procedures
9.0 Features to Be Tested
10.0 Features Not to Be Tested
11.0 Resources/Roles & Responsibilities
12.0 Schedules
13.0 Significantly Impacted Departments (SIDs)
14.0 Dependencies
15.0 Risks/Assumptions
16.0 Tools
17.0 Approvals
1.0 INTRODUCTION
A brief summary of the product being tested. Outline all the functions at a high level.
2.0 OBJECTIVES AND TASKS
2.1 Objectives
Describe the objectives supported by the Master Test Plan, eg., defining tasks and
responsibilities, vehicle for communication, document to be used as a service level
agreement, etc.
2.2 Tasks
List all tasks identified by this Test Plan, i.e., testing, post-testing, problem reporting, etc.
3.0 SCOPE
General
This section describes what is being tested, such as all the functions of a specific product,
its existing interfaces, integration of all functions.
Tactics
List here how you will accomplish the items that you have listed in the "Scope" section.
For example, if you have mentioned that you will be testing the existing interfaces, what
would be the procedures you would follow to notify the key people to represent their
respective areas, as well as allotting time in their schedule for assisting you in
accomplishing your activity?
4.0 TESTING STRATEGY
Describe the overall approach to testing. For each major group of features or feature
combinations, specify the approach which will ensure that these feature groups are
adequately tested. Specify the major activities, techniques, and tools which are used to
test the designated groups of features.
The approach should be described in sufficient detail to permit identification of the major
testing tasks and estimation of the time required to do each one.
4.1 Unit Testing
Definition:
Specify the minimum degree of comprehensiveness desired. Identify the techniques
which will be used to judge the comprehensiveness of the testing effort (for example,
determining which statements have been executed at least once). Specify any additional
completion criteria (for example, error frequency). The techniques to be used to trace
requirements should be specified.
Participants:
List the names of individuals/departments who would be responsible for Unit Testing.
Methodology:
Describe how unit testing will be conducted. Who will write the test scripts for the unit
testing, what would be the sequence of events of Unit Testing and how will the testing
activity take place?
4.2 System and Integration Testing
Definition:
List what is your understanding of System and Integration Testing for your project.
Participants:
Who will be conducting System and Integration Testing on your project? List the
individuals that will be responsible for this activity.
Methodology:
Describe how System & Integration testing will be conducted. Who will write the test
scripts for the unit testing, what would be sequence of events of System & Integration
Testing, and how will the testing activity take place?
4.3 Performance and Stress Testing
Definition:
List what is your understanding of Stress Testing for your project.
Participants:
Who will be conducting Stress Testing on your project? List the individuals that will be
responsible for this activity.
Methodology:
Describe how Performance & Stress testing will be conducted. Who will write the test
scripts for the testing, what would be sequence of events of Performance & Stress
Testing, and how will the testing activity take place?
4.4 User Acceptance Testing
Definition:
The purpose of acceptance test is to confirm that the system is ready for operational use.
During acceptance test, end-users (customers) of the system compare the system to its
initial requirements.
Participants:
Who will be responsible for User Acceptance Testing? List the individuals' names and
responsibility.
Methodology:
Describe how the User Acceptance testing will be conducted. Who will write the test
scripts for the testing, what would be sequence of events of User Acceptance Testing, and
how will the testing activity take place?
4.5 Batch Testing
4.6 Automated Regression Testing
Definition:
Regression testing is the selective retesting of a system or component to verify that
modifications have not caused unintended effects and that the system or component still
works as specified in the requirements.
Participants:
Methodology:
4.7 Beta Testing
Participants:
Methodology:
5.0 HARDWARE REQUIREMENTS
Computers
Modems
6.0 ENVIRONMENT REQUIREMENTS
6.1 Main Frame
Specify both the necessary and desired properties of the test environment. The
specification should contain the physical characteristics of the facilities, including the
hardware, the communications and system software, the mode of usage (for example,
stand-alone), and any other software or supplies needed to support the test. Also specify
the level of security which must be provided for the test facility, system software, and
proprietary components such as software, data, and hardware.
Identify special test tools needed. Identify any other testing needs (for example,
publications or office space). Identify the source of all needs which are not currently
available to your group.
6.2 Workstation
7.0 TEST SCHEDULE
Include test milestones identified in the Software Project Schedule as well as all item
transmittal events.
Define any additional test milestones needed. Estimate the time required to do each
testing task. Specify the schedule for each testing task and test milestone. For each
testing resource (that is, facilities, tools, and staff), specify its periods of use.
8.0 CONTROL PROCEDURES
Problem Reporting
Document the procedures to follow when an incident is encountered during the testing
process. If a standard form is going to be used, attach a blank copy as an "Appendix" to
the Test Plan. In the event you are using an automated incident logging system, write
those procedures in this section.
Change Requests
Document the process of modifications to the software. Identify who will sign off on the
changes and what would be the criteria for including the changes to the current product.
If the changes will affect existing programs, these modules need to be identified.
9.0 FEATURES TO BE TESTED
Identify all software features and combinations of software features that will be tested.
10.0 FEATURES NOT TO BE TESTED
Identify all features and significant combinations of features which will not be tested and
the reasons.
11.0 RESOURCES/ROLES & RESPONSIBILITIES
Specify the staff members who are involved in the test project and what their roles are
going to be (for example, Mary Brown (User) compile Test Cases for Acceptance
Testing). Identify groups responsible for managing, designing, preparing, executing, and
resolving the test activities as well as related issues. Also identify groups responsible for
providing the test environment. These groups may include developers, testers, operations
staff, testing services, etc.
12.0 SCHEDULES
Major Deliverables
Identify the deliverable documents. You can list the following documents:
- Test Plan
- Test Cases
- Test Incident Reports
- Test Summary Reports
13.0 SIGNIFICANTLY IMPACTED DEPARTMENTS (SIDs)
Department/Business Area Bus. Manager Tester(s)
14.0 DEPENDENCIES
Identify significant constraints on testing, such as test-item availability, testing-resource
availability, and deadlines.
15.0 RISKS/ASSUMPTIONS
Identify the high-risk assumptions of the test plan. Specify contingency plans for each
(for example, delay in delivery of test items might require increased night shift
scheduling to meet the delivery date).
16.0 TOOLS
List the Automation tools you are going to use. List also the Bug tracking tool here.
17.0 APPROVALS
Specify the names and titles of all persons who must approve this plan. Provide space for
the signatures and dates.
Name (In Capital Letters) Signature Date
1.
2.
3.
4.
End.
Ask me your Software Testing, Job, Interview queries at www.softwaretestinghelp.com
Passion For Testing, Passion For Quality!

5 comments:

  1. Hi, Your site is very nice. Information on etl testing is very useful. Just would like to ask you if you can copy the images again in the article. They are not actually visible now. also, it will be great if you can keep some momentum on the site so that others too will keep posting their views..

    Thanks
    Kamesh

    ReplyDelete
  2. Very useful article thank you for sharing. Know more about ETL Testing Training

    ReplyDelete
  3. fabulous information you have shared, I would like to say thanks for you. Please share more valuable content on msbi online course bangalore

    ReplyDelete