Data Warehousing

 Data Warehousing Overview


Data warehousing is the process of integrating enterprise-wide corporate data into a single repository. The resulting data warehouse may then support a variety of decision analysis functions as well as strategic operational functions.

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.

According to Bill Inmon:
A Warehouse is a Subject-oriented, integrated, time-variant and non-Volatile collection of data in support of management's decision making process.
  • Subject Oriented: Data that represents a particular subject Area like sales, Mktg etc instead of a company's ongoing operations.
  • Integrated: Data that is collected from multiple source systems integrated into a user readable unique format. Ex: male, female, 0, 1, M, F.
  • Non Volatile: The data stores historical, but the data is never removed.
  • Time Variant: The data stores in time wise like weekly, monthly, quarterly, yearly.

The approaches in constructing a Datawarehouse and the DataMart:
  1. Top-down approaches: In top-down approach to data warehouse design, in which the data warehouse is using a normalized enterprise data model. 
  2. Bottom-up approaches: In the bottom-up approach data marts are first created to provide reporting and analytical capabilities for specific business processes.
In term of design data warehousing and data mart are almost same.


In general a Datawarehousing is used on enterprise level and DataMart is used on a business unit/department level. 

Data Mart

A data structure that is optimized for access. It is designed to facilitate end-user analysis of data. It typically supports a single, analytic application used by a distinct set of business line.

The Data Mart is a subset of the Data Warehouse.

Dimensional Model

What is dimensional modelling?
Dimensional model consists of dimension and fact tables. Fact tables store different transactional measurements and the foreign keys from dimension tables that qualifies the data. The goal of Dimensional model is not to achieve high degree of normalization but to facilitate easy and faster data retrieval.

Ralph Kimball is one of the strongest proponents of this very popular data modelling technique which is often used in many enterprise level data warehouses

Benefits of Dimensional Modelling
1.     Faster Data Retrieval
2.     Better Understand-ability
3.     Extensibility

In dimensional model, everything is divided in 2 distinct categories - dimension or measures. Anything we try to model, must fit in one of these two categories.


A type of data modelling suited for data warehousing. In a dimensional model, there are two types of tables: dimensional tables and fact tables.

Dimensional table records information on each dimension and Fact table records all the "Fact" or "Measures".

Data Warehouse Schema Types

Star Schema:
Star Schema is a simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions. The center of the star schema consists of a large fact table and it points towards the dimension tables. The advantage of star schema are slicing down, performance increase and easy understanding of data.

Note: Star Schema is a de-normalized form

Benefit of Start Schema: 
Star schema is probably most popular schema in dimensional modelling because of its simplicity and flexibility. In a Star schema design, any information can be obtained just by traversing a single join, which means this type of schema will be ideal for information retrieval (faster query processing). In this all hierarchies (or levels) of the members of a dimension are stored in the single dimension table. 

Snowflake Schema:
A snowflake schema is similar to a star schema structure but normalized through the use of outrigger tables. Here one or more dimension tables are connected with other dimension table as well as with the central fact table.

Note: Star Schema is a normalized form

Benefit of Snowflake Schema: 
In OLAP, this Snowflake schema approach increases the number of joins and poor performance in retrieval of data. In few organizations, they try to normalize the dimension tables to save space. Since dimension tables hold less space, Snowflake schema approach may be avoided.

Fact Table:
The centralized table in a star schema is called as FACT table. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.

Fact table can be added across several dimensions. Fact tables store different types of measures like additive, non additive and semi additive measures.

A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables). 

Fact table that contains no measures or facts. These tables are called as Factless Fact tables.

Measure Types:
  • Additive - Measures that can be added across all dimensions. 
  • Non Additive - Measures that cannot be added across all dimensions. 
  • Semi Additive - Measures that can be added across few dimensions and not with others

No comments:

Post a Comment