Aggregation
Data aggregation is the process where data is collected and presented in a summarized format for statistical analysis and to effectively achieve business objectives. Data aggregation is vital to data warehousing as it helps to make decisions based on vast amounts of raw data. Data aggregation provides the ability to forecast future trends and aids in predictive modelling. Effective data aggregation techniques help to minimize performance problems.
Aggregation provides more information based on related clusters of data such as an individual’s income or profession. For example, a store may want to look at the sales performance for different regions, so they would aggregate the sales data based on region.
Queries with aggregation (with mathematical functions) provide faster results. For example, the query for the sum of sales of a product in a month brings up faster results than the query for sales of the product in general. This is because the aggregation is applied on the former query and only the sum is displayed, while the latter query brings up individual records. Faster queries imply the better performance of the system.
Aggregates are used in dimensional models of the data warehouse to produce positive effects on the time it takes to query large sets of data. At the simplest form an aggregate is a simple summary table that can be derived by performing a Group by SQL query. A more common use of aggregates is to take a dimension and change the granularity of this dimension. When changing the granularity of the dimension the fact table has to be partially summarized to fit the new grain of the new dimension, thus creating new dimensional and fact tables, fitting this new level of grain. Aggregates are sometimes referred to as pre-calculated summary data, since aggregations are usually precomputed, partially summarized data, that are stored in new aggregated tables. When facts are aggregated, it is either done by eliminating dimensionality or by associating the facts with a rolled up dimension. Rolled up dimensions should be shrunken versions of the dimensions associated with the granular base facts. This way, the aggregated dimension tables should conform to the base dimension tables. So the reason why aggregates can make such a dramatic increase in the performance of the data warehouse is the reduction of the number of rows to be accessed when responding to a query.
Types:
Time aggregation: It provides the data point for single resources for a defined time period.
Spatial aggregation: It provided the data point for a group of resources for a defined time period.
Query Facility
Sum: Adds together all the specified data to get a total.
Average: Computes the average value of the specific data.
Max: Displays the highest value for each category.
Min: Displays the lowest value for each category.
Count: Counts the total number of data entries for each category.
Data can also be aggregated by date, allowing trends to be shown over a period of years, quarters, months, etc. These aggregations could be placed in a hierarchy, where you can view the data trends over a period of years, then see the data trends over months for each individual year.
OLAP function and Tools
OLAP databases uses a multidimensional data model, allowing for complex analytical and ad hoc queries with a rapid execution time and also includes features of navigational databases, hierarchical databases and relational databases. OLAP system consists of an OLAP cube, which is also known as multidimensional cube or a hypercube.
OLAP cube consists of numeric facts called measures which are categorized by dimensions. The measures are placed at the intersections of the hypercube, which is spanned by the dimensions as a vector space. The cube metadata is typically created from a star schema of tables in a relational database. Measures are derived from the records in the fact table and dimensions are derived from the dimension tables.