Four Approaches to Data Warehousing

The requirements of a Data Warehouse solution continually change during its life time. For some organizations you’ll typically expect more changes than others. Think of maturity, history of organizational changes: mergers, acquisitions etc.

You should adopt a solution type that will yield the best value taking these things into account.

Data Warehouse Solutions

When confronted with a new data warehouse solution that I am going to design or redesign I will typically choose between one of the following four approaches:

One-Shot Data Warehouse
Keeping History in Dimensions
Historical / Persistent Staging Area
Data Vault

In this blog post I’ll briefly dive into these approaches and discuss some of the implementation issues.

The choice for a specific approach depends on a number of circumstances, notably:

The expected total size of the solution
The number of source systems and how the information in them overlap
The average expected life-time of source systems: source system volatility
Current user requirements
Expected number of changes in user requirements: requirements volatility

1. One-Shot Data Warehouse (small- and mid-sized solution)

Current computing possibilities will typically enable the complete build (from scratch) of a data warehouse solution for small- and mid-sized solutions within the typically nightly maintenance window. This can be a very efficient solution with a high return on investment. Some people call it the destroy and rebuild approach since you are removing all previous data from the data warehouse before rebuilding it.

An obvious disadvantage of this approach is that there is no track of changes in the source system: If a customer moves from London to Paris all the historical sales will be reported as Paris sales. Another disadvantage is the sheer amount of data that is transferred every night. That can accumulate to an amount that can not been loaded in the nightly maintenance window.

2. Keeping History in Dimensions

The problem of keeping track of history has been a major issue in data warehousing. In the theories by Ralph Kimball Slowly Changing Dimensions play an import role. In his books he mentioned some patterns to handle history in dimensions.

Type 1: Change history

For some attributes the business isn’t interested in keeping historical information. This of course is most appropriate when correcting previous errors. Or in other situations where there is no business value in keeping track of historical information. For example when a person’s name changes due to marriage. The rows that were imported earlier in the dimension table are corrected by an UPDATE statement.

Type 2: Keep history

In this pattern history is preserved because a new record is entered for every change in a dimension attribute. The old row is marked as inactive and an end date is added to the row. Any new fact tables records that are inserted after this dimension row is changed will link to the newly added row.

Other types

Type 1 and Type 2 are the most commonly used. However there are some alternatives. They are described in this Wikipedia article.

Other advantages

Besides the tracking of history another major advantage is that you’ll only need changed and new rows (the delta rows) from your source system. And thus diminishing the time you need to load the data in your data warehouse. However getting only these delta rows can be challenging.

3. Historical / Persistent Staging Area

In this approach you’ll typically get a copy of all relevant source tables and add temporal information in a ‘StartDate’ and a ‘EndDate’ column. Again you’ll only process the delta rows: loading new and changed rows. And whenever a row has changed you’ll end date the old row.

Based upon this historical staging area you can adopt method 1 or 2 to load the data in your dimensional model.

What are the reasons for having a persistent staging area?

In an earlier blog post I gave this answer:

Well …first of all it could be a demand from auditors or data governance initiatives. Possible driven by external regulations.(Sarbanes-Oxley, Basel I, Basel II, HIPAA, ..)

A second reason which can be tied to the first has to do with keeping control of / reporting on data quality issues in the source systems and thus:

Identify possible improvements in processes. (e.g. same data entered twice)
Increase / decrease confidence in the data, information and decisions

The third reason has to do with agility. The ability to respond to changes rapidly. Small changes like changing a dimension attribute from type 1 to 2 or adding additional attributes to a dimension. But also large changes like big organizational changes, mergers, new ERP implementations. By having a persistent staging area it’ll be far easier to respond to these changes and rebuild the data marts (with history). Resulting in quicker, better reporting after such a change.

4. Data Vault

The Data Vault is a special implementation of a persistent staging area. Source tables are split based on column aspects separating the structural items (business keys and the associations between the business keys) from the descriptive attributes.

And thus the same advantages apply to the Data Vault as well. The power of the Data Vault comes from the integration on business keys. This approach is therefore especially suitable if you have source systems with overlapping information and/or if source systems change frequently over time.

This wiki article by Ronald Kunenberg gives a nice overview of the Data Vault Methodology with a lot of additional references.

Conclusion

In this article I discussed some aspects that you should consider when you choose an approach for (re)designing your Data Warehouse. This table summarizes these aspects:

Aspect / Approach:	1	2	3	4
Simple, fast solution	x
Only process delta rows		x	x	x
Keep track of historical information		x	x	x
Solve auditing / data governance issues			x	x
Control over data quality in source systems			x	x
Easily adapt to changing user requirements			x	x
Source systems with overlapping information				x
Frequently changing source systems				x

Four Approaches to Data Warehousing

Data Warehouse Solutions

1. One-Shot Data Warehouse (small- and mid-sized solution)

2. Keeping History in Dimensions

Type 1: Change history

Type 2: Keep history

Other types

Other advantages

3. Historical / Persistent Staging Area

What are the reasons for having a persistent staging area?

4. Data Vault

Conclusion

Trending Articles

Long Distance Relationship Tagalog Love Quotes

Tinkerbell para colorear, pintar e imprimir

Pokemon para colorear

Arbol genealogico para colorear

Dibujos de animales para imprimir

Re:Mutton Pies (lleechef)

Mga Tala sa “Unang Siglo ng Nobela sa Filipinas” (2009) ni Virgilio S. Almario

Re: lwIP PIC32 port - new title : CycloneTCP a new open source stack for...

Kurup heroin 9.09 grams na 4mer, Pynthor, shah kem uwei

PREMATURE CAMPAIGNING – Meron ba nun?

HOY PANGIT, MAGBAYAD KA!

Henry el monstruo feliz para colorear e imprimir

Dama y vagabundo para pintar

Girasoles para colorear

Good Morning Quotes

RE: Mutton Pies (jameskoiman)

Hagibis (1946) by Francisco V. Coching

Ka riam ka beit bad ka por riam

Vimeo Create - Video Maker & Editor 1.6.0 by Vimeo Inc

Vimeo 3.42.1 by Vimeo Inc