Tuning Data Warehouse
A data warehouse keeps evolving and it is unpredictable what query the user is going to post in the future. Therefore, it becomes more difficult to tune a data warehouse system.
Difficulties in Data Warehouse Tuning
Tuning a data warehouse is a difficult procedure due to following reasons:
- Data warehouse is dynamic; it never remains constant.
- It is very difficult to predict what query the user is going to post in the future.
- Business requirements change with time.
- Users and their profiles keep changing.
- The user can switch from one group to another.
- The data load on the warehouse also changes with time.
Data Load Tuning
Data load is a critical part of overnight processing. Nothing else can run until data load is complete. This is the entry point into the system.
Note: If there is a delay in transferring the data, or in arrival of data then the entire system is affected badly. Therefore, it is very important to tune the data load first.
There are various approaches of tuning data load:
The very common approach is to insert data using the SQL Layer. In this approach, normal checks and constraints need to be performed. When the data is inserted into the table, the code will run to check for enough space to insert the data. If sufficient space is not available, then more space may have to be allocated to these tables. These checks take time to perform and are costly to CPU.
The second approach is to bypass all these checks and constraints and place the data directly into the preformatted blocks. These blocks are later written to the database. It is faster than the first approach, but it can work only with whole blocks of data. This can lead to some space wastage.
The third approach is that while loading the data into the table that already contains the table, we can maintain indexes.
The fourth approach says that to load the data in tables that already contain data, drop the indexes & recreate them when the data load is complete. The choice between the third and the fourth approach depends on how much data is already loaded and how many indexes need to be rebuilt.
- Fixed queries
- Ad hoc queries
Fixed queries are well defined. Following are the examples of fixed queries −
- Regular reports
- Canned queries
- Common aggregations
Tuning the fixed queries in a data warehouse is same as in a relational database system. The only difference is that the amount of data to be queried may be different. It is good to store the most successful execution plan while testing fixed queries. Storing these executing plan will allow us to spot changing data size and data skew, as it will cause the execution plan to change.
Ad hoc Queries
To understand ad hoc queries, it is important to know the ad hoc users of the data warehouse. For each user or group of users, you need to know the following:
- The number of users in the group
- Whether they use ad hoc queries at regular intervals of time
- Whether they use ad hoc queries frequently
- Whether they use ad hoc queries occasionally at unknown intervals.
- The maximum size of query they tend to run
- The average size of query they tend to run
- Whether they require drill-down access to the base data
- The elapsed login time per day
- The peak time of daily usage
- The number of queries they run per peak hour
Testing Data Warehouse
Testing is very important for data warehouse systems for data validation and to make them work correctly and efficiently.
There are three basic levels of testing performed on data warehouse which are as follows:
This type of testing is being performed at the developer’s end. In unit testing, each unit/component of modules is separately tested. Each modules of the whole data warehouse, i.e. program, SQL Script, procedure, Unix shell is validated and tested.
In this type of testing the various individual units/ modules of the application are brought together or combined and then tested against the number of inputs. It is performed to detect the fault in integrated modules and to test whether the various components are performing well after integration.
System testing is the form of testing that validates and tests the whole data warehouse application. This type of testing is being performed by technical testing team. This test is conducted after developer’s team performs unit testing and the main purpose of this testing is to check whether the entire system is working altogether or not.
Challenges of data warehouse testing are:
- Data selection from multiple source and analysis that follows pose great challenge.
- Volume and complexity of the data, certain testing strategies are time consuming.
- ETL testing requires hive SQL skills, thus it pose challenges for tester who have limited SQL skills.
- Redundant data in a data warehouse.
- Inconsistent and inaccurate reports.
ETL testing is performed in five stages:
- Identifying data sources and requirements.
- Data acquisition.
- Implement business logic’s and dimensional modelling.
- Build and populate data.
- Build reports.