Building a data warehouse: a step-by-step guide

Tatyana Korobeyko

Building a data warehouse: a step-by-step guide

With the amount of data worldwide forecasted to grow up to 180 zettabytes by 2025, businesses have to deal with two major issues – where to store their data and how to make use of it. In place since the 1980s and constantly having their functionality extended, data warehouses can help deal with both these challenges. However, regardless of the technology’s maturity and the fact that data warehouses are usually developed by experts, the percentage of failed projects is disturbing, according to the research from the independent market research firm Vanson Bourne.

In this article, we will dive into the details of data warehouse implementation by outlining the two fundamental approaches to data warehouse design and data warehouse development steps. We also give advice on a suitable team composition for data warehousing consulting services and recommend technologies for creating a scalable solution.

What is a data warehouse and why build one?

A data warehouse is a system, which consolidates and stores enterprise information from diverse sources in a form suitable for analytical querying and reporting to support business intelligence and data analytics initiatives. The successful implementation of such a repository promises multiple benefits, including:

3 core components of a data warehouse architecture

When you create the architecture of your future data warehouse, you have to take into account multiple factors, such as how many data sources will connect to the data warehouse, the amount of information in each of them together with its nature and complexity, your analytics objectives, existing technology environment, and so on. However, stating that each architecture is unique in its kind would be wrong, since practically each of them has the following three components:

  1. Source systems – operational databases capturing transactions, IoT devices streaming sensor data, SaaS applications, external data sources, etc.
  2. Data staging area – a zone that temporarily hosts copied data and a set of processes that help you clean and transform it according to the business-defined rules before loading into a data warehouse. With a staging area, you have a historical record of the original data to rely on in case an ETL job fails. Usually, as soon as the ETL job is completed successfully, the information from the staging area is erased. However, you may still save it for a certain period of time for legacy reasons, or archive. This area can be omitted if all data transformations occur in the data warehouse database itself.
  3. Data storage – a data warehouse database for company-wide information and data marts (DWH subsets), created for specific departments or lines of business.

Besides these elements, an enterprise data warehousing solution also encompasses a data governance and metadata management component. The extended data warehouse environment may also include OLAP cubes (multidimensional data structures that store aggregated data to enable interactive queries) and a data access layer (tools and applications for end users to access and manipulate the stored information). However, these elements are a part of a bigger ecosystem – a BI architecture, so we won’t explore them here.

Build a high-performing data warehouse with Itransition

Approaches to building a data warehouse

The two fundamental design methods, which are used to build a data warehouse, are Inmon’s (Top-down) and Kimball’s (Bottom-up) approaches.

Inmon’s approach

Within Inmon’s approach, firstly, a centralized repository for enterprise information is designed according to a normalized data model, where atomic data is stored in tables that are grouped together by subject areas with the help of joins. After the enterprise data warehouse is built, the data stored there is used to structure data marts.

Inmon’s approach is more preferable in cases when you need to:

However, one of the major constraints of this method is that the setup and implementation is more time and resource-consuming compared to Kimball’s approach.

Kimball’s approach

Kimball’s approach suggests that dimensional data marts should be created first, then if required, a company may proceed with creating a logical enterprise data warehouse.

The advocates of this approach point out that since dimensional data marts require minimal normalization, such data warehouse projects take less time and resources. On the other hand, you may find duplicate data in tables and have to repeat ETL activities, as each data mart is created independently.

Though the two approaches may seem rather different, they complement each other well, which is proven by the emergence of alternative approaches that combine the principles of both design methods.

A step-by-step guide for building a data warehouse

It is common practice to start a data warehouse initiative with a comprehensive readiness assessment. When evaluating the readiness for a data warehouse project, consider such factors as:

After you’ve assessed the readiness for the project and are hopefully satisfied with it, you need to develop a framework for project planning and management, and then, eventually, move on to data warehouse development, which starts with the definition of your business requirements.

1. Business requirements definition

Business requirements affect almost every decision throughout the data warehouse development process – from what information should be available to how often it should be accessed. Therefore, it’s viable to start with interviewing your business users to define:

While interviewing business users, you should also set effective communication with your key IT specialists (database administrators, operational source system experts, etc.) to identify if the currently available information is sufficient in meeting such business requirements as:

2. Data warehouse conceptualization and technology selection

The findings from the previous step are used as a foundation for defining the scope of the solution-to-be, so the needs and expectations of your business and IT users should be carefully analyzed and prioritized to draw up the optimal data warehouse feature set.

After that, you have to identify the architectural approach to building a data warehousing solution, evaluate and select the optimal technology for each of the architectural components – staging area, storage area, etc. While drawing up the tech stack, consider such factors as:

By this time, you also should define the deployment option – on-premises, cloud or hybrid. The deployment option choice is dictated by numerous factors, such as data volume, data nature, costs, security requirements, number of users and their location as well as system availability among others.

3. Data warehouse environment design

Before and during designing your data warehouse, you need to define your data sources and analyze information stored in there – what data types and structures are available, volume of information generated daily, monthly, etc., in addition to its quality, sensitivity, refresh frequency.

The next step would be logical data modeling, or arranging company’s data into a series of logical relationships called entities (real-world objects) and attributes (characteristics that define these objects). Entity-relationship modeling is used in various modeling techniques, including a normalized schema (a design approach for relational databases) and a star schema (used for dimensional modeling).

Next, these logical data models are converted into database structures, for example, entities are converted into tables, attributes are translated into columns, relationships are converted into foreign key constraints, and so on.

After data modeling is finished, the first step is to design the data staging area to provide the data warehouse with high-quality aggregated data in the first place and also to define and control the source-to-target data flow during all subsequent data loads.

The design step also encompasses the creation of data access and usage policies, the establishment of the metadata catalog, business glossaries, etc.

4. Data warehouse development and launch

The step starts with customizing and configuring the selected technologies (DWH platform, data transformation technologies, data security software, etc.). The company then develops ETL pipelines and introduces data security.

After all major components are introduced, they have to be integrated with the existing data infrastructure (data sources, BI and analytics software, a data lake, etc.) as well as each other so the data can be migrated afterwards.

Before the final roll-up, you have to ensure that your end users can handle the new technology environment, meaning that all of them understand what information is available, what it means, how to access it and what tools to use. Customized training for both standard and power users as well as support documentation will help with that. Besides that, you need to:

5. After-launch support and maintenance

After the initial deployment, you need to focus on your business users and provide ongoing support and education. Over time, data warehouse performance metrics and user satisfaction scores will have to be measured, as it’ll help you ensure the long-term health and growth of your data warehouse.

Need a reliable tech partner to bring your data warehouse project to life?

Key roles for a data warehouse project

Project manager

Business analyst

Data modeler

Data warehouse database administrator (DBA)

ETL developer

Quality assurance engineer

Besides these key roles, other professionals may participate in the project as well, such as a solution architect, a tech support specialist, a DevOps engineer, a data steward, a data warehouse trainer, etc. It is worth noting that sometimes individual staff members may perform several roles.

3 leading data warehouse technologies to consider

Using inappropriate technology is one of the reasons why data warehouse projects fail. Besides the fact that you need to correctly identify your use case, you also need to choose the optimal software from numerous seemingly similar options available on the market. Here, we review data warehouse services and platforms that have great customer satisfaction scores, are rated highly in various market research reports, and embrace the principles of data warehouse modernization. The described functionality is not exhaustive though: while drawing up their descriptions, we mainly concentrated on their data integration capabilities, built-in connectivity with analytics and business intelligence services, reliability, and data security.

Amazon Redshift

Google BigQuery

Azure Synapse Analytics