A star schema is a popular data modeling technique used in data warehousing. It is a simple and efficient way to represent data for reporting and analysis. In a star schema, data is organized into a fact table and multiple dimension tables. The fact table contains the measures or metrics, while the dimension tables contain the attributes or context in which the measures are calculated. The schema is named after its resemblance to a star shape, with the fact table at the center and the dimension tables radiating outwards like star points.
Features of a star schema:
- Fact table: The fact table contains the measures or metrics of interest, such as sales revenue, units sold, or customer count. Each row in the fact table represents a unique combination of dimension keys and measures. The fact table is usually the largest table in the schema, and it is connected to the dimension tables through foreign keys.
- Dimension tables: Dimension tables contain the attributes or context in which the measures are calculated. For example, a product dimension table might contain attributes such as product category, brand, and SKU, while a time dimension table might contain attributes such as date, month, and quarter. Each dimension table is usually smaller than the fact table, and it is connected to the fact table through foreign keys.
- Snowflake schema: In some cases, a star schema may be further normalized to reduce redundancy and improve efficiency. This is known as a snowflake schema, where some of the dimension tables are split into smaller tables. For example, a product dimension table might be split into separate tables for product category, brand, and SKU.
Here’s an example of a star schema for a retail company:
Fact table: Sales
- Date_key
- Product_key
- Store_key
- Sales_amount
- Units_sold
Dimension tables:
- Date
- Date_key
- Date
- Day_of_week
- Month
- Quarter
- Year
- Product
- Product_key
- Product_category
- Brand
- SKU
- Store
- Store_key
- Store_name
- City
- State
- Country
In this schema, the fact table is Sales, which contains the measures of sales_amount and units_sold. The fact table is connected to the dimension tables Date, Product, and Store through their respective foreign keys. Each dimension table contains the attributes that describe the dimension.
With this star schema in place, the retail company can easily analyze their sales data by different dimensions, such as time, product, and store. For example, they can analyze sales revenue by product category, brand, and store location, or track sales trends over time and geography. This information can be used to make data-driven decisions about inventory management, marketing strategies, and sales promotions.
Type of Star Schemas
- Simple Star Schema: A simple star schema has one fact table and a few dimension tables. It is the most basic type of star schema and is easy to understand and implement. For example, consider the following simple star schema for a sales database:
Fact Table:
- Sales
- Date key
- Product key
- Customer key
- Quantity
- Price
- Total Sales
Dimension Tables:
- Date
- Date key
- Date
- Day of week
- Month
- Quarter
- Year
- Product
- Product key
- Product name
- Product category
- Customer
- Customer key
- Customer name
- Customer address
In this schema, the sales fact table is connected to three dimension tables: date, product, and customer.
- Complex Star Schema: A complex star schema has multiple fact tables and many dimension tables. This type of schema is used when there are different types of data that need to be analyzed together. For example, consider the following complex star schema for a healthcare database:
Fact Tables:
- Hospital Admissions
- Admission ID
- Patient ID
- Date of Admission
- Date of Discharge
- Diagnosis
- Procedure
- Doctor ID
- Nurse ID
- Lab Results
- Result ID
- Patient ID
- Test Name
- Test Date
- Test Result
Dimension Tables:
- Patients
- Patient ID
- Patient Name
- Age
- Gender
- Doctors
- Doctor ID
- Doctor Name
- Specialty
- Nurses
- Nurse ID
- Nurse Name
- Shift
- Tests
- Test Name
- Test Category
In this schema, the hospital admissions and lab results fact tables are connected to four-dimension tables: patients, doctors, nurses, and tests.
- Snowflake Schema: A snowflake schema is a variation of a star schema where dimension tables are normalized into multiple related tables. This helps to reduce redundancy and improve efficiency, but it can make the schema more complex. For example, consider the following snowflake schema for a retail database:
Fact Table:
- Sales
- Date key
- Product key
- Store key
- Sales Amount
- Units Sold
Dimension Tables:
- Date
- Date key
- Date
- Day of Week
- Month
- Quarter
- Year
- Product
- Product key
- Product name
- Product category
- Supplier key
- Supplier
- Supplier key
- Supplier name
- Supplier address
- Store
- Store key
- Store name
- Store location
- Region key
- Region
- Region key
- Region name
- Region manager
In this schema, the product and store dimension tables are normalized into supplier and region tables, respectively.
- Galaxy Schema: A galaxy schema is a type of star schema that has multiple fact tables that are connected to multiple dimension tables. This type of schema is used when there are many different types of data that need to be analyzed together. For example, consider the following galaxy schema for a sports database:
Fact Tables:
- Game Results
- Game ID
- Home Team ID
- Away Team ID
- Home Team Score
- Away Team Score
- Game Date
- Player Stats
- Game ID
- Player ID
- Points
Dimension Tables:
- Teams
- Team ID
- Team Name
- Team City
- Team Conference
- Team Division
- Players
- Player ID
- Player Name
- Player Position
- Player Height
- Player Weight
- Games
- Game ID
- Game Date
- Game Location
- Game Time
- Conferences
- Conference ID
- Conference Name
- Divisions
- Division ID
- Division Name
In this schema, the game results and player stats fact tables are connected to four-dimension tables: teams, players, games, and conferences/divisions. This allows for analysis of both team and player performance across multiple games and seasons, as well as comparisons between teams in different conferences and divisions.