Building a data warehouse: a step-by-step guide

Tatyana Korobeyko

With the amount of data worldwide forecasted to grow up to 180 zettabytes by 2025, businesses have to deal with two major issues – where to store their data and how to make use of it. In place since the 1980s and constantly having their functionality extended, data warehouses can help deal with both these challenges. However, regardless of the technology’s maturity and the fact that data warehouses are usually developed by experts, the percentage of failed projects is disturbing, according to the research from the independent market research firm Vanson Bourne.

In this article, we will dive into the details of data warehouse implementation by outlining the two fundamental approaches to data warehouse design and data warehouse development steps. We also give advice on a suitable team composition for data warehousing consulting services and recommend technologies for creating a scalable solution.

What is a data warehouse and why build one?

A data warehouse is a system, which consolidates and stores enterprise information from diverse sources in a form suitable for analytical querying and reporting to support business intelligence and data analytics initiatives. The successful implementation of such a repository promises multiple benefits, including:

Fact-based decisions taken at the speed of business as end-users can effortlessly access and work with a company’s historical information as well as current information collected from disparate heterogeneous systems.
Decision-making based on high-quality information, because prior to entering a data warehouse, data undergoes comprehensive cleansing and transformation processes. In addition to this, many data management activities become automated, which helps eliminate error-prone manual data aggregation.
When a data warehouse is integrated with self-service BI solutions, such as Power BI or Tableau, data culture is adopted naturally across a company.
Due to the unified approach to data governance, which besides other things implies solid definition and management of data security policies, the risk of data breaches and leaks is minimized.

3 core components of a data warehouse architecture

When you create the architecture of your future data warehouse, you have to take into account multiple factors, such as how many data sources will connect to the data warehouse, the amount of information in each of them together with its nature and complexity, your analytics objectives, existing technology environment, and so on. However, stating that each architecture is unique in its kind would be wrong, since practically each of them has the following three components:

Source systems – operational databases capturing transactions, IoT devices streaming sensor data, SaaS applications, external data sources, etc.
Data staging area – a zone that temporarily hosts copied data and a set of processes that help you clean and transform it according to the business-defined rules before loading into a data warehouse. With a staging area, you have a historical record of the original data to rely on in case an ETL job fails. Usually, as soon as the ETL job is completed successfully, the information from the staging area is erased. However, you may still save it for a certain period of time for legacy reasons, or archive. This area can be omitted if all data transformations occur in the data warehouse database itself.
Data storage – a data warehouse database for company-wide information and data marts (DWH subsets), created for specific departments or lines of business.

Besides these elements, an enterprise data warehousing solution also encompasses a data governance and metadata management component. The extended data warehouse environment may also include OLAP cubes (multidimensional data structures that store aggregated data to enable interactive queries) and a data access layer (tools and applications for end users to access and manipulate the stored information). However, these elements are a part of a bigger ecosystem – a BI architecture, so we won’t explore them here.

Build a high-performing data warehouse with Itransition

Approaches to building a data warehouse

The two fundamental design methods, which are used to build a data warehouse, are Inmon’s (Top-down) and Kimball’s (Bottom-up) approaches.

Inmon’s approach

Within Inmon’s approach, firstly, a centralized repository for enterprise information is designed according to a normalized data model, where atomic data is stored in tables that are grouped together by subject areas with the help of joins. After the enterprise data warehouse is built, the data stored there is used to structure data marts.

Inmon’s approach is more preferable in cases when you need to:

Get a single source of truth while ensuring data consistency, accuracy and reliability
Quickly develop data marts with no effort duplication for extracting data from original sources, cleansing, etc.

However, one of the major constraints of this method is that the setup and implementation is more time and resource-consuming compared to Kimball’s approach.

Kimball’s approach

Kimball’s approach suggests that dimensional data marts should be created first, then if required, a company may proceed with creating a logical enterprise data warehouse.

The advocates of this approach point out that since dimensional data marts require minimal normalization, such data warehouse projects take less time and resources. On the other hand, you may find duplicate data in tables and have to repeat ETL activities, as each data mart is created independently.

Though the two approaches may seem rather different, they complement each other well, which is proven by the emergence of alternative approaches that combine the principles of both design methods.

A step-by-step guide for building a data warehouse

It is common practice to start a data warehouse initiative with a comprehensive readiness assessment. When evaluating the readiness for a data warehouse project, consider such factors as:

Availability of strong business sponsors – influential managers who can envision the potential of the initiative and help promote it.
Business motivation – whether a data warehouse can help address some critical business problem.
Current data maturity across the company – in other words, whether end-users realize the importance of data-driven decision making, high data quality, etc.
The ability of IT specialists and business users to collaborate.
Feasibility of the existing technical and data environment.

After you’ve assessed the readiness for the project and are hopefully satisfied with it, you need to develop a framework for project planning and management, and then, eventually, move on to data warehouse development, which starts with the definition of your business requirements.

1. Business requirements definition

Business requirements affect almost every decision throughout the data warehouse development process – from what information should be available to how often it should be accessed. Therefore, it’s viable to start with interviewing your business users to define:

Overall goals of the company as well as goals of particular business units, departments, etc.
Methods and metrics that are used for measuring success
Key issues the business faces
Types of routine data analysis the company currently performs, including what data is used for that, how often the analysis takes place, what potential improvements it has brought, etc.).

While interviewing business users, you should also set effective communication with your key IT specialists (database administrators, operational source system experts, etc.) to identify if the currently available information is sufficient in meeting such business requirements as:

Key operational systems
Data update frequency
Availability of historical data
What processes are set to ensure the delivery of information to business users
What tools are used to access and analyze information
What types of insights are routinely generated
If ad hoc requests for information are handled well, etc.

2. Data warehouse conceptualization and technology selection

The findings from the previous step are used as a foundation for defining the scope of the solution-to-be, so the needs and expectations of your business and IT users should be carefully analyzed and prioritized to draw up the optimal data warehouse feature set.

After that, you have to identify the architectural approach to building a data warehousing solution, evaluate and select the optimal technology for each of the architectural components – staging area, storage area, etc. While drawing up the tech stack, consider such factors as:

Your current technological environment
Planned strategic technological directions
Technical competencies of the in-house IT team members
Specific data security requirements, etc.

By this time, you also should define the deployment option – on-premises, cloud or hybrid. The deployment option choice is dictated by numerous factors, such as data volume, data nature, costs, security requirements, number of users and their location as well as system availability among others.

3. Data warehouse environment design

Before and during designing your data warehouse, you need to define your data sources and analyze information stored in there – what data types and structures are available, volume of information generated daily, monthly, etc., in addition to its quality, sensitivity, refresh frequency.

The next step would be logical data modeling, or arranging company’s data into a series of logical relationships called entities (real-world objects) and attributes (characteristics that define these objects). Entity-relationship modeling is used in various modeling techniques, including a normalized schema (a design approach for relational databases) and a star schema (used for dimensional modeling).

Next, these logical data models are converted into database structures, for example, entities are converted into tables, attributes are translated into columns, relationships are converted into foreign key constraints, and so on.

After data modeling is finished, the first step is to design the data staging area to provide the data warehouse with high-quality aggregated data in the first place and also to define and control the source-to-target data flow during all subsequent data loads.

The design step also encompasses the creation of data access and usage policies, the establishment of the metadata catalog, business glossaries, etc.

4. Data warehouse development and launch

The step starts with customizing and configuring the selected technologies (DWH platform, data transformation technologies, data security software, etc.). The company then develops ETL pipelines and introduces data security.

After all major components are introduced, they have to be integrated with the existing data infrastructure (data sources, BI and analytics software, a data lake, etc.) as well as each other so the data can be migrated afterwards.

Before the final roll-up, you have to ensure that your end users can handle the new technology environment, meaning that all of them understand what information is available, what it means, how to access it and what tools to use. Customized training for both standard and power users as well as support documentation will help with that. Besides that, you need to:

Test the data warehouse performance, ETL, etc.
Verify data quality (data legibility, completeness, security, etc.)
Ensure users have access to a data warehouse, etc.

5. After-launch support and maintenance

After the initial deployment, you need to focus on your business users and provide ongoing support and education. Over time, data warehouse performance metrics and user satisfaction scores will have to be measured, as it’ll help you ensure the long-term health and growth of your data warehouse.

Need a reliable tech partner to bring your data warehouse project to life?

Key roles for a data warehouse project

Project manager

Defines the scope of the data warehouse project and its deliverables.
Outlines the project’s plan, including budget estimations, project resourcing and timeframes.
Manages day-to-day data warehouse project tasks and activities (resource coordination, project status tracking, project progress and communication bottlenecks , etc.)

Business analyst

Identifies business users’ requirements and ensures they are clearly communicated to the tech team.

Conducts interviews and documents them.
Assists data modeler and DBAs in data modeling, data mapping activities, etc.

Data modeler

Performs detailed data analysis.

Designs the overall technical architecture of the data warehouse and each component in particular (data staging, data storage, data models, etc.).
Supervises architecture development and implementation.
Advises on a technology stack.
Documents the scope of the overall solution and its constituents.

Data warehouse database administrator (DBA)

Translates logical models into physical table structures.
Ensures operational support of the database, tunes database performance to ensure data availability and integrity.
Plans a data backup/recovery plan, etc.

ETL developer

Plans, develops and sets up the extraction, transformation, and loading pipelines.

Quality assurance engineer

Develops a test strategy to ensure a data warehouse’s proper functioning and data accuracy.
Identifies potential errors and ensures their resolution.
Run tests on the developed DWH solution.

Besides these key roles, other professionals may participate in the project as well, such as a solution architect, a tech support specialist, a DevOps engineer, a data steward, a data warehouse trainer, etc. It is worth noting that sometimes individual staff members may perform several roles.

3 leading data warehouse technologies to consider

Using inappropriate technology is one of the reasons why data warehouse projects fail. Besides the fact that you need to correctly identify your use case, you also need to choose the optimal software from numerous seemingly similar options available on the market. Here, we review data warehouse services and platforms that have great customer satisfaction scores, are rated highly in various market research reports, and embrace the principles of data warehouse modernization. The described functionality is not exhaustive though: while drawing up their descriptions, we mainly concentrated on their data integration capabilities, built-in connectivity with analytics and business intelligence services, reliability, and data security.

Amazon Redshift

Offers the federated query capability and built-in cloud data integration with Amazon S3 to query and analyze data of any type, format and size across operational databases and a data lake.
Allows ingesting and transforming data in streams and batches, within and outside the AWS services with AWS Data Pipeline, AWS Data Migration Services, AWS Glue, and AWS Kinesis Firehose.
Offers native integration with the AWS analytics services (AWS Lake Formation, Amazon EMR, Amazon QuickSight, Amazon SageMaker, etc.).
Provides built-in fault tolerance and disaster recovery capabilities (automated cluster snapshots, snapshots replications, continuous cluster monitoring and replacement, etc.).
Safeguards data with granular permissions on tables, multi-factor user authentication, data encryption, etc.
Meets compliance requirements for SOC1, SOC2, SOC3, PCI DSS Level 1, HIPAA, ISO 27001, etc.
Allows to decouple storage and compute resources.

Google BigQuery

Offers native data integration with 150+ data sources via Cloud Fusion
Provides multi-cloud analytics support (provided by Google BigQuery (Omni)) to query data across AWS and Azure (coming soon) without copying data.
Native integration with Looker and the whole Google Cloud Analytics ecosystem.
Charges for cold and hot data as well as for storage and compute resources separately.
Provides replicated storage in multiple locations charge-free by default.
Offers granular permissions on datasets, tables, views, multi-factor user authentication, data encryption (by default), etc.
Meets compliance requirements for HIPAA, ISO 27001, PCI DSS, SOC1, SOC2, etc.

Azure Synapse Analytics

Has 95+ native connectors for on-premises and cloud data sources via Azure Data Factory.
Offers support for native HTAP via Azure Synapse Link.
Supports big data and streaming data ingestion and processing with the built-in Apache Spark and Azure Stream Analytics event-processing engine.
Native integrations with Power BI, Azure Machine Learning, Azure Cognitive Services, Azure Data Lake Storage, etc.
Allows scaling storage and computation separately.
Offers built-in fault tolerance and disaster recovery capabilities (automatic snapshots, geo-backup, etc.).
Default data security features (granular permissions on schemas, tables, views, individual columns, procedures, etc., multi-factor user authentication, data encryption, etc.).
Meets compliance requirements for HIPAA, ISO 27001, PCI DSS, SOC1, SOC2, etc.