Total Pageviews

Wednesday, 29 February 2012

ETL Tesing 1

ETL testing :
Testing is an important phase in the project life cycle.The first part of an ETL process involves extracting the data from the source systems. In many cases this is the most challenging aspect of ETL, as extracting data correctly will set the stage for how subsequent processes will go.

Extract, transform and load (ETL) is a process in database usage and especially in data warehousing that involves:

Data warehousing and its Concepts:

What is Data warehouse?
Data Warehouse is a central managed and integrated database containing data from the operational sources in an organization (such as SAP, CRM, ERP system). It may gather manual inputs from users determining criteria and parameters for grouping or classifying records.
Data warehouse database contains structured data for query analysis and can be accessed by users. The data warehouse can be created or updated at any time, with minimum disruption to operational systems. It is ensured by a strategy implemented    in            ETL         process.
A source for the data warehouse is a data extract from operational databases. The data is validated, cleansed, transformed and finally aggregated and it becomes ready to be loaded into the data warehouse.
Data warehouse is a dedicated database which contains detailed, stable, non-volatile and consistent data which can be analyzed in the time variant.
Sometimes, where only a portion of detailed data is required, it may be worth considering using a data mart.
A data mart is generated from the data warehouse and contains data focused on a given subject and data that is frequently accessed or summarized.
 
Advantages of Data warehouse:
Ø  Data warehouse provides a common data model for all data of interest regardless of the data's source. This makes it easier to report and analyze information than it would be if multiple data models were used to retrieve information such as sales invoices, order receipts, general ledger charges, etc.
Ø  Inconsistencies are identified and resolved prior to loading of data in the Data warehouse. This greatly simplifies reporting and analysis.
Ø  Information in the data warehouse is under the control of data warehouse users so that, even if the source system data is purged over time, the information in the warehouse can be stored safely for extended periods of time.
Ø  Because they are separate from operational systems, data warehouses provide retrieval of data without slowing down operational systems.
Ø   Data warehouses enhance the value of operational business applications, notably customer relationship management (CRM) systems.
Ø   Data warehouses facilitate decision support system applications such as trend reports (e.g., the items with the most sales in a particular area within the last two years), exception reports, and reports that show actual performance versus goals.
Disadvantages of Data Warehouse:
Ø  Data warehouses are not the optimal environment for unstructured data.
Ø  Because data must be extracted, transformed and loaded into the warehouse, there is an element of latency in data warehouse data.
Ø  Over their life, data warehouses can have high costs. Maintenance costs are high.
Ø  Data warehouses can get outdated relatively quickly. There is a cost of delivering suboptimal information to the organization.
Ø  There is often a fine line between data warehouses and operational systems. Duplicate, expensive functionality may be developed. Or, functionality may be developed in the data warehouse that, in retrospect, should have been developed in the operational systems and vice versa.

ETL Concept:

ETL is the automated and auditable data acquisition process from source system that involves one or more sub processes of data extraction, data transportation, data transformation, data consolidation, data integration, data loading and data cleaning.

E  - Extracting data from source operational or archive systems which are primary source of data for the data warehouse.
T - Transforming the data – which may involve cleaning, filtering, validating and applying business rules.
L  - Loading the data into the data warehouse or any other database or application that houses the data.

Extraction:

The first part of an ETL process involves extracting the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside sources such as through web spidering or screen-scraping. Extraction converts the data into a format for transformation processing.
An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure. If not, the data may be rejected entirely or in part.

Transformation:

Transformation is the series of tasks that prepares the data for loading into the warehouse. Once data is secured, you have worry about its format or structure. Because it will be not be in the format needed for the target. Example the grain level, data type, might be different. Data cannot be used as it is. Some rules and functions need to be applied to transform the data
One of the purposes of ETL is to consolidate the data in a central repository or to bring it at one logical or physical place. Data can be consolidated from similar systems, different subject areas, etc.
ETL must support data integration for the data coming from multiple sources and data coming at different times. This has to be seamless operation. This will avoid overwriting existing data, creating duplicate data or even worst simply unable to load the data in the target

Loading:

Loading process is critical to integration and consolidation. Loading process decides the modality of how the data is added in the warehouse or simply rejected. Methods like addition, Updating or deleting are executed at this step. What happens to the existing data?  Should the old data be deleted because of new information? Or should the data be archived? Should the data be treated as additional data to the existing one?
So data to the data warehouse has to loaded with utmost care for which data auditing process can only establish the confidence level. This auditing process normally happens after the loading of data.

List of ETL tools:

Below is the list of ETL Tools available in the market:

List of ETL Tools
ETL Vendors
Oracle Warehouse Builder (OWB)
Oracle 
Data Integrator & Data Services 
SAP Business Objects
IBM Information Server (Datastage)
IBM
SAS Data Integration Studio
SAS Institute
PowerCenter
Informatica 
Elixir Repertoire
Elixir
Data Migrator
Information Builders
SQL Server Integration Services
Microsoft 
Talend Open Studio
Talend
DataFlow Manager
Pitney Bowes Business Insight
Data Integrator
Pervasive
Open Text Integration Center
Open Text
Transformation Manager
ETL Solutions Ltd.
Data Manager/Decision Stream
IBM (Cognos)
Clover ETL
Javlin 
ETL4ALL
IKAN
DB2 Warehouse Edition
IBM
Pentaho Data Integration
Pentaho 
Adeptia Integration Server
Adeptia
ETL Testing:

Following are some common goals for testing an ETL application:


Data completeness - To ensure that all expected data is loaded.

Data Quality - It promises that the ETL application correctly rejects, substitutes default values, corrects and reports invalid data.

Data transformation - This is meant for ensuring that all data is correctly transformed according to business rules and design specifications.
Performance and scalability- This is to ensure that the data loads and queries perform within expected time frames and the technical architecture is scalable.
Integration testing- It is to ensure that ETL process functions well with other upstream and downstream applications.

User-acceptance testing - It ensures the solution fulfills the users’ current expectations and also anticipates their future expectations.

Regression testing - To keep the existing functionality intact each time a new release of code is completed.

Basically data warehouse testing is divided into two categories ‘Back-end testing’ and ‘Front-end testing’. The former applies where the source systems data is compared to the end-result data in Loaded area which is the ETL testing. While the latter refers to where the user checks the data by comparing their MIS with the data that is displayed by the end-user tools.


Data Validation:
Data completeness is one of the basic ways for data validation. This is needed to verify that all expected data loads into the data warehouse. This includes the validation of all the records, fields and ensures that the full contents of each field are loaded.


Data Transformation:
Validating that the data is transformed correctly based on business rules, can be one of the most complex parts of testing an ETL application with significant transformation logic. Another way of testing is to pick up some sample records and compare them for validating data transformation manually, but this method requires manual testing steps and testers who have a good amount of experience and understand of the ETL logic.

Data Warehouse Testing Life Cycle:

Like any other piece of software a DW implementation undergoes the natural cycle of Unit testing, System testing, Regression testing, Integration testing and Acceptance testing.

Unit testing: Traditionally this has been the task of the developer. This is a white-box testing to ensure the module or component is coded as per agreed upon design specifications. The developer should focus on the following:

a) That all inbound and outbound directory structures are created properly with appropriate permissions and sufficient disk space. All tables used during the ETL are present with necessary privileges.

b) The ETL routines give expected results:
i. All transformation logics work as designed from source till target
ii. Boundary conditions are satisfied− e.g. check for date fields with leap year dates
iii. Surrogate keys have been generated properly
iv. NULL values have been populated where expected
v. Rejects have occurred where expected and log for rejects is created with sufficient details
vi. Error recovery methods
vii. Auditing is done properly

c) That the data loaded into the target is complete:
i. All source data that is expected to get loaded into target, actually get loaded− compare counts between source and target and use data profiling tools
ii. All fields are loaded with full contents− i.e. no data field is truncated while transforming
iii. No duplicates are loaded
iv. Aggregations take place in the target properly
v. Data integrity constraints are properly taken care of

System testing: Generally the QA team owns this responsibility. For them the design document is the bible and the entire set of test cases is directly based upon it. Here we test for the functionality of the application and mostly it is black-box. The major challenge here is preparation of test data. An intelligently designed input dataset can bring out the flaws in the application more quickly. Wherever possible use production-like data. You may also use data generation tools or customized tools of your own to create test data. We must test for all possible combinations of input and specifically check out the errors and exceptions. An unbiased approach is required to ensure maximum efficiency. Knowledge of the business process is an added advantage since we must be able to interpret the results functionally and not just code-wise.
The QA team must test for:

i.                     Data completeness− match source to target counts terms of business. Also the load windows refresh period for the DW and the views created should be signed off from users.
ii.                   Data aggregations− match aggregated data against staging tables.
iii.                  Granularity of data is as per specifications.
iv.                 Error logs and audit tables are generated and populated properly.
v.                   Notifications to IT and/or business are generated in proper format

Regression testing: A DW application is not a one-time solution. Possibly it is the best example of an incremental design where requirements are enhanced and refined quite often based on business needs and feedbacks. In such a situation it is very critical to test that the existing functionalities of a DW application are not messed up whenever an enhancement is made to it. Generally this is done by running all functional tests for existing code whenever a new piece of code is introduced. However, a better strategy could be to preserve earlier test input data and result sets and running the same again. Now the new results could be compared against the older ones to ensure proper functionality.

Integration testing: This is done to ensure that the application developed works from an end-to-end perspective. Here we must consider the compatibility of the DW application with upstream and downstream flows. We need to ensure for data integrity across the flow. Our test strategy should include testing for:

i. Sequence of jobs to be executed with job dependencies and scheduling
ii. Re-startability of jobs in case of failures
iii. Generation of error logs
iv. Cleanup scripts for the environment including database

This activity is a combined responsibility and participation of experts from all related applications is a must in order to avoid misinterpretation of results.

Acceptance testing: This is the most critical part because here the actual users validate your output datasets. They are the best judges to ensure that the application works as expected by them. However, business users may not have proper ETL knowledge. Hence, the development and test team should be ready to provide answers regarding ETL process that relate to data population. The test team must have sufficient business knowledge to translate the results in terms of business. Also the load windows, refresh period for the DW and the views created should be signed off from users.

Performance testing: In addition to the above tests a DW must necessarily go through another phase called performance testing. Any DW application is designed to be scalable and robust. Therefore, when it goes into production environment, it should not cause performance problems. Here, we must test the system with huge volume of data. We must ensure that the load window is met even under such volumes. This phase should involve DBA team, and ETL expert and others who can review and validate your code for optimization.


Summary:

Testing a DW application should be done with a sense of utmost responsibility. A bug in a DW traced at a later stage results in unpredictable losses. And the task is even more difficult in the absence of any single end-to-end testing tool. So the strategies for testing should be methodically developed, refined and streamlined. This is also true since the requirements of a DW are often dynamically changing. Under such circumstances repeated discussions with development team and users is of utmost importance to the test team. Another area of concern is test coverage. This has to be reviewed multiple times to ensure completeness of testing. Always remember, a DW tester must go an extra mile to ensure near defect free solutions.


Defect Severity determines the defect's effect on the application where as
Defect Priority determines the defect urgency of repair.

Severity is given by Testers and Priority by Developers

1. High Severity & Low Priority : For example an application which generates some banking related reports weekly, monthly, quarterly & yearly by doing some calculations. If there is a fault while calculating yearly report. This is a high severity fault but low priority because this fault can be fixed in the next release as a change request.

2. High Severity & High Priority : In the above example if there is a fault while calculating weekly report. This is a high severity and high priority fault because this fault will block the functionality of the application immediately within a week. It should be fixed urgently.

3. Low Severity & High Priority : If there is a spelling mistake or content issue on the homepage of a website which has daily hits of lakhs. In this case, though this fault is not affecting the website or other functionalities but considering the status and popularity of the website in the competitive market it is a high priority fault.

4. Low Severity & Low Priority : If there is a spelling mistake on the pages which has very less hits throughout the month on any website. This fault can be considered as low severity and low priority.

ETL TESTING:
ETL basically stands for Extract Transform Load - which simply implies the process where you extract data from Source Tables, transform them in to the desired format based on certain rules and finally load them onto Target tables. There are numerous tools that help you with ETL process - Informatica, Control-M being a few notable ones.

So ETL Testing implies - Testing this entire process using a tool or at table level with the help of test cases and Rules Mapping document.

In ETL Testing, the following are validated -
1) Data File loads from Source system on to Source Tables.
2) The ETL Job that is designed to extract data from Source tables and then move them to staging tables. (Transform process)
3) Data validation within the Staging tables to check all Mapping Rules / Transformation Rules are followed.
4) Data Validation within Target tables to ensure data is present in required format and there is no data loss from Source to Target tables.

Challenges of Data warehouse Testing
•                  Data selection from multiple source systems and
analysis that follows pose great challenge.
•                  Volume and the complexity of the data.
•                  Inconsistent and redundant data in a data warehouse.
•                  Inconsistent and Inaccurate reports.
•                  Non-availability of History data.
3                  Testing Methodology
•                  Use of Traceability to enable full test coverage of
Business Requirements
•                  In depth review of Test Cases
•                  Manipulation of Test Data to ensure full test
coverage

Fig 1 Testing Methodology (V- Model)
•                  Provision of appropriate tools to speed the process
of Test Execution & Evaluation
•                  Regression Testing
4                  Testing Types
The following are types of Testing performed for Data
warehousing projects.
1.                 Unit Testing.
2.                 Integration Testing.
3.                 Technical Shakedown Testing.
4.                 System Testing.
5.                 Operation readiness Testing
6.                 User Acceptance Testing.
4.1              Unit Testing
The objective of Unit testing involves testing of Business
transformation rules, error conditions, mapping fields at
staging and core levels.
Unit testing involves the following
1.                 Check the Mapping of fields present in staging
level.
2.                 Check for the duplication of values generated using
Sequence generator.
3.                 Check for the correctness of surrogate keys, which
uniquely identifies rows in database.
4.                 Check for Data type constraints of the fields
present in staging and core levels.
5.                 Check for the population of status and error
messages into target table.
6.                 Check for string columns are left and right trimmed.
7.                 Check every mapping needs to implement the process
abort mapplet which is invoked if the number of record read
from source is not equal to trailer count.
8.                 Check every object, transformation, source and
target need to have proper metadata. Check visually in data
warehouse designer tool if every transformation has a
meaningful description.
4.2              Integration Testing
The objective of Integration Testing is to ensure that
workflows are executed   as scheduled with correct
dependency.
Integration testing involves the following
1.                 To check for the execution of workflows at the
following stages
      Source to Staging  A.
Staging A to Staging  B.
      Staging B to Core.
2.                 To check target tables are populated with correct
number of records.
3.                 Performance of the schedule is recorded and
analysis is performed   
                on the performance result.
4.                 To verify the dependencies among workflows between
source to staging, staging to staging and staging to core
is have been properly defined.
5.                 To Check for Error log messages in appropriate file.
6.                 To verify if the start jobs starts at pre-defined
starting time. Example if the start time for first job has
been configured to be at 10:00AM and the Control-M group
has been ordered at 7AM, the first job would not start in
Control-M until 10:00AM.
7.                 To check for restarting of Jobs in case of failures.
4.3              Technical Shakedown Test
Due to the complexity in integrating the various source
systems and tools, there are expected to be several
teething problems with the environments. A Technical
Shakedown Test will be conducted prior to commencing System
Testing, Stress & Performance, User Acceptance testing and
Operational Readiness Test to ensure the following points
are proven:
•                  Hardware is in place and has been configured
correctly (including Informatica architecture, Source
system connectivity and Business Objects).
•                  All software has been migrated to the testing
environments correctly.
•                  All required connectivity between systems are in
place.
•                  End-to-end transactions (both online and batch
transactions) have been executed and do not fall over.
4.4              System Testing
The objective of System Testing is to ensure that the
required business functions are implemented correctly. This
phase includes data verification which tests the quality of
data populated into target tables.
 System Testing involves the following
1.                 To check the functionality of the system meets the
business specifications.
2.                 To check for the count of records in source table
and comparing with the number of records in the target
table followed by analysis of rejected records.
3.                 To check for end to end integration of systems and
connectivity of the infrastructure (e.g. hardware and
network configurations are correct),
4.                 To check all transactions, database updates and
data flows functions for accuracy.
5.                 To validate Business reports functionality.

Reporting functionality               Ability to report data as required
by Business using Business Objects
Report Structure       Since the universe and reports have
been migrated from previous version of Business Objects,
it’s necessary to ensure that the upgraded reports
replicate the structure/format and data requirements (until
and unless a change / enhancement has been documented in
Requirement Traceability Matrix / Functional Design
Document).
Enhancements          Enhancements like reports’ structure,
prompts ordering which were in scope of upgrade project
will be tested
Data Accuracy            The data displayed in the reports / prompts
matches with the actual data in data mart.
Performance              Ability of the system to perform certain
functions within a prescribed time.
That the system meets the stated performance criteria
according to agreed SLAs or specific non-functional
requirements.
Security    That the required level of security access
is controlled and works properly, including domain
security, profile security, Data Security, UserID and
password control, and access procedures. That the security
system cannot be bypassed.
Usability   That the system is useable as per specified
requirements
User Accessibility      That specified type of access to
data is provided to users
Connection Parameters             Test the connection
Data provider             Check for the right universe and duplicate
data
Conditions/Selection criteria    Test the for selection
criteria for the correct logic
Object testing            Test the objects definitions
Context testing         Ensure formula is with input or output
context
Variable testing         Test the variable for its syntax
and data type compatible
Formulas or calculations             Test the formula for its
syntax and validate the data given by the formula
Filters        Test the data has filter correctly
Alerts        Check for extreme limits Report alerts
Sorting      Test the sorting order of Section headers fields,
blocks
Totals and subtotals                     Validate the data results
Universe Structure  Integrity of universe is maintained
and there are no divergences in terms of joins / objects /
prompts
4.5              User Acceptance Testing
The objective of this testing to ensure that System meets
the expectations of the business users.  It aims to prove
that the entire system operates effectively in a production
environment and that the system successfully supports the
business processes from a user's perspective. Essentially,
these tests will run through “a day in the life of”
business users.  The tests will also include functions that
involve source systems connectivity, jobs scheduling and
Business reports functionality.
4.6              Operational Readiness Testing (ORT)
This is the final phase of testing which focuses on
verifying the deployment of software and the operational
readiness of the application. The main areas of testing in
this phase include:
 Deployment Test
1.                 Tests the deployment of the solution
2.                 Tests overall technical deployment “checklist” and
timeframes
3.                 Tests the security aspects of the system including
user authentication and   authorization, and user-access
levels.
 Operational and Business Acceptance Testing
1.                 Tests the operability of the system including job
control and scheduling.
2.                 Tests include normal scenarios, abnormal, and fatal
scenarios
5                  Test Data
Given the complexity of Data warehouse projects;
preparation of test data is daunting task. Volume of data
required for each level of testing is given below.
Unit Testing - This phase of testing will be performed with
a small subset (20%) of production data for each source
system.
Integration Testing - This phase of testing will be
performed with a small subset of production data for each
source system.
System Testing – This phase of a subset of live data will
be used which is sufficient in volume to contain all
required test conditions that includes normal scenarios,
abnormal, and fatal scenarios but small enough that
workflow execution time does not impact the test schedule
unduly.
6                  Conclusion
Data warehouse solutions are becoming almost ubiquitous as
a supporting technology for the operational and strategic
functions at most companies. Data warehouses play an
integral role in business functions as diverse as
enterprise process management and monitoring, and
production of financial statements. The approach described
here combines an understanding of the business rules
applied to the data with the ability to develop and use
testing procedures that check the accuracy of entire data
sets. This level of testing rigor requires additional
effort and more skilled resources. However, by employing
this methodology, the team can be more confident, from day
one of the implementation of the DW, in the quality of the
data. This will build the confidence of the end-user
community, and it will ultimately lead to a more effective
implementation.


Please go through the above document.As far as my knowledge
we don't have any ETL Testing Certifications.

ETL testing
You can do that by comparing the source vs warehouse tables in SQL Server you can use like this

source.table1
except
warehouse.table1
Union
Warehouse.table
except
source.table1


Why do you do the above union is you are taking source table as the base table and compare with warehouse then take the warehouse as source and compare that with source table then you get the differences.
ETL testing
ETL testing validates that data is transformed correctly from OLTP to data
warehouse.

Validating the data transformed includes following main verification points:


  1. Ensures that all expected data is loaded. Comparing record counts between
    source data, data loaded to the warehouse and rejected records.

  1. Ensures that all data is transformed correctly according to design
    specifications.

  1. Ensures that the ETL application substitutes default values, ignores
    invalid data.

  1. Validate correct processing of ETL-generated fields such as surrogate
    keys.

  1. Validate that data types in the warehouse are as specified in the design
    and/or the data model.

  1. Validate the referential integrity between tables.
  




12 comments:

  1. Hi

    really u provided valuable information for etl testing.thanks(sulthanabad,karimnagar,Andhrapradesh,india>

    ReplyDelete
  2. hey buddy !
    It will good if you can share ETL testing real challenges faced day to day activity. like data validation from source to target using sql query,defects found during data validation. type of testing performed in staging to dwh, etc etc

    ReplyDelete
  3. Thank you for the nice article here. Really nice and keep update to explore more ideas.

    Big Data Testing Services

    ReplyDelete
  4. Deep machine learning is the way to understand machine learning in a better way. By this, you can verify machine learning solution providers' quality in many different terms. You need to be alert and updated to avoid any kind of misleading.

    ReplyDelete
  5. Nice information, this is will helpfull a lot, Thank for sharing, Keep do posting i like to follow this informatica online training

    ReplyDelete
  6. Thanks your blog and very informative. Keep posting.
    etl testing course
    etl testing

    ReplyDelete
  7. IntelliMindz is the best IT Training in Bangalore with placement, offering 200 and more software courses with 100% Placement Assistance.

    ETL Testing Online Training

    ETL Testing Course In Bangalore

    ETL Testing Course In Chennai

    ETL Testing Course In Coimbatore

    ETL Testing Training In Tirupur

    ReplyDelete
  8. IT's very informative blog and useful article thank you for sharing with us , keep posting learn more about Product engineering services | Product engineering solutions.

    ReplyDelete
  9. Good article on ETL testing. Thank you for sharing this.
    QA Career Portal
    ETL Testing Jobs

    ReplyDelete