Concept Hierarchy
A concept hierarchy represents a series of mappings from a set of low-level concepts to larger-level, more general concepts. Concept hierarchy organizes information or concepts in a hierarchical structure or a specific partial order, which are used for defining knowledge in brief, high-level methods, and creating possible mining knowledge at several levels of abstraction.
A conceptual hierarchy includes a set of nodes organized in a tree, where the nodes define values of an attribute known as concepts. A specific node, “ANY”, is constrained for the root of the tree. A number is created to the level of each node in a conceptual hierarchy. The level of the root node is one. The level of a non-root node is one more the level of its parent level number.
Because values are defined by nodes, the levels of nodes can also be used to describe the levels of values. Concept hierarchy enables raw information to be managed at a higher and more generalized level of abstraction.
There are several types of concept hierarchies which are as follows:
Set-Grouping Hierarchy: A set-grouping hierarchy constructs values for a given attribute or dimension into groups or constant range values. It is also known as instance hierarchy because the partial series of the hierarchy is represented on the set of instances or values of an attribute. These hierarchies have more functional sense and are so approved than other hierarchies.
Schema Hierarchy: Schema hierarchy represents the total or partial order between attributes in the database. It can define existing semantic relationships between attributes. In a database, more than one schema hierarchy can be generated by using multiple sequences and grouping of attributes.
Operation-Derived Hierarchy: Operation-derived hierarchy is represented by a set of operations on the data. These operations are defined by users, professionals, or the data mining system. These hierarchies are usually represented for mathematical attributes. Such operations can be as easy as range value comparison, as difficult as a data clustering and data distribution analysis algorithm.
Rule-based Hierarchy: In a rule-based hierarchy either a whole concept hierarchy or an allocation of it is represented by a set of rules and is computed dynamically based on the current information and rule definition. A lattice-like architecture is used for graphically defining this type of hierarchy, in which each child-parent route is connected with a generalization rule.
3 Tier Architecture
Data Warehouses usually have a three-level (tier) architecture that includes:
- Bottom Tier (Data Warehouse Server)
- Middle Tier (OLAP Server)
- Top Tier (Front end Tools).
A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS. It may include several specialized data marts and a metadata repository.
Data from operational databases and external sources (such as user profile data provided by external consultants) are extracted using application program interfaces called a gateway. A gateway is provided by the underlying DBMS and allows customer programs to generate SQL code to be executed at a server.
A middle-tier which consists of an OLAP server for fast querying of the data warehouse.
The OLAP server is implemented using either
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps functions on multidimensional data to standard relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly implements multidimensional information and operations.
A top-tier that contains front-end tools for displaying results provided by OLAP, as well as additional tools for data mining of the OLAP-generated data.
ETL
ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a process in which an ETL tool extracts the data from various data source systems, transforms it in the staging area, and then finally, loads it into the Data Warehouse system.
Extraction:
The first step of the ETL process is extraction. In this step, data from various source systems is extracted which can be in various formats like relational databases, No SQL, XML, and flat files into the staging area. It is important to extract the data from various source systems and store it into the staging area first and not directly into the data warehouse because the extracted data is in various formats and can be corrupted also. Hence loading it directly into the data warehouse may damage it and rollback will be much more difficult. Therefore, this is one of the most important steps of ETL process.
Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or functions are applied on the extracted data to convert it into a single standard format. It may involve following processes/tasks:
- Filtering: Loading only certain attributes into the data warehouse.
- Cleaning: Filling up the NULL values with some default values, mapping U.S.A, United States, and America into USA, etc.
- Joining: Joining multiple attributes into one.
- Splitting: Splitting a single attribute into multiple attributes.
- Sorting: Sorting tuples on the basis of some attribute (generally key-attribute).
Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is finally loaded into the data warehouse. Sometimes the data is updated by loading into the data warehouse very frequently and sometimes it is done after longer but regular intervals. The rate and period of loading solely depends on the requirements and varies from system to system.
Data Marting
A Data Mart is focused on a single functional area of an organization and contains a subset of data stored in a Data Warehouse. A Data Mart is a condensed version of Data Warehouse and is designed for use by a specific department, unit or set of users in an organization. E.g., Marketing, Sales, HR or finance. It is often controlled by a single department in an organization.
Data Mart usually draws data from only a few sources compared to a Data warehouse. Data marts are small in size and are more flexible compared to a Datawarehouse.
There are three main types of data mart:
Dependent: Dependent data marts are created by drawing data directly from operational, external or both sources.
Independent: Independent data mart is created without the use of a central data warehouse.
Hybrid: This type of data marts can take data from data warehouses or operational systems.