With the amount of data worldwide forecasted to grow up to 180 zettabytes by 2025, businesses have to deal with two major issues – where to store their data and how to make use of it. In place since the 1980s and constantly having their functionality extended, data warehouses can help deal with both these challenges. However, regardless of the technology’s maturity and the fact that data warehouses are usually developed by experts, the percentage of failed projects is disturbing, according to the research from the independent market research firm Vanson Bourne.
In this article, we will dive into the details of data warehouse implementation by outlining the two fundamental approaches to data warehouse design and data warehouse development steps. We also give advice on a suitable team composition for data warehousing consulting services and recommend technologies for creating a scalable solution.
A data warehouse is a system, which consolidates and stores enterprise information from diverse sources in a form suitable for analytical querying and reporting to support business intelligence and data analytics initiatives. The successful implementation of such a repository promises multiple benefits, including:
When you create the architecture of your future data warehouse, you have to take into account multiple factors, such as how many data sources will connect to the data warehouse, the amount of information in each of them together with its nature and complexity, your analytics objectives, existing technology environment, and so on. However, stating that each architecture is unique in its kind would be wrong, since practically each of them has the following three components:
Besides these elements, an enterprise data warehousing solution also encompasses a data governance and metadata management component. The extended data warehouse environment may also include OLAP cubes (multidimensional data structures that store aggregated data to enable interactive queries) and a data access layer (tools and applications for end users to access and manipulate the stored information). However, these elements are a part of a bigger ecosystem – a BI architecture, so we won’t explore them here.
The two fundamental design methods, which are used to build a data warehouse, are Inmon’s (Top-down) and Kimball’s (Bottom-up) approaches.
Within Inmon’s approach, firstly, a centralized repository for enterprise information is designed according to a normalized data model, where atomic data is stored in tables that are grouped together by subject areas with the help of joins. After the enterprise data warehouse is built, the data stored there is used to structure data marts.
Inmon’s approach is more preferable in cases when you need to:
However, one of the major constraints of this method is that the setup and implementation is more time and resource-consuming compared to Kimball’s approach.
Kimball’s approach suggests that dimensional data marts should be created first, then if required, a company may proceed with creating a logical enterprise data warehouse.
The advocates of this approach point out that since dimensional data marts require minimal normalization, such data warehouse projects take less time and resources. On the other hand, you may find duplicate data in tables and have to repeat ETL activities, as each data mart is created independently.
Though the two approaches may seem rather different, they complement each other well, which is proven by the emergence of alternative approaches that combine the principles of both design methods.
It is common practice to start a data warehouse initiative with a comprehensive readiness assessment. When evaluating the readiness for a data warehouse project, consider such factors as:
After you’ve assessed the readiness for the project and are hopefully satisfied with it, you need to develop a framework for project planning and management, and then, eventually, move on to data warehouse development, which starts with the definition of your business requirements.
Business requirements affect almost every decision throughout the data warehouse development process – from what information should be available to how often it should be accessed. Therefore, it’s viable to start with interviewing your business users to define:
While interviewing business users, you should also set effective communication with your key IT specialists (database administrators, operational source system experts, etc.) to identify if the currently available information is sufficient in meeting such business requirements as:
The findings from the previous step are used as a foundation for defining the scope of the solution-to-be, so the needs and expectations of your business and IT users should be carefully analyzed and prioritized to draw up the optimal data warehouse feature set.
After that, you have to identify the architectural approach to building a data warehousing solution, evaluate and select the optimal technology for each of the architectural components – staging area, storage area, etc. While drawing up the tech stack, consider such factors as:
By this time, you also should define the deployment option – on-premises, cloud or hybrid. The deployment option choice is dictated by numerous factors, such as data volume, data nature, costs, security requirements, number of users and their location as well as system availability among others.
Before and during designing your data warehouse, you need to define your data sources and analyze information stored in there – what data types and structures are available, volume of information generated daily, monthly, etc., in addition to its quality, sensitivity, refresh frequency.
The next step would be logical data modeling, or arranging company’s data into a series of logical relationships called entities (real-world objects) and attributes (characteristics that define these objects). Entity-relationship modeling is used in various modeling techniques, including a normalized schema (a design approach for relational databases) and a star schema (used for dimensional modeling).
Next, these logical data models are converted into database structures, for example, entities are converted into tables, attributes are translated into columns, relationships are converted into foreign key constraints, and so on.
After data modeling is finished, the first step is to design the data staging area to provide the data warehouse with high-quality aggregated data in the first place and also to define and control the source-to-target data flow during all subsequent data loads.
The design step also encompasses the creation of data access and usage policies, the establishment of the metadata catalog, business glossaries, etc.
The step starts with customizing and configuring the selected technologies (DWH platform, data transformation technologies, data security software, etc.). The company then develops ETL pipelines and introduces data security.
After all major components are introduced, they have to be integrated with the existing data infrastructure (data sources, BI and analytics software, a data lake, etc.) as well as each other so the data can be migrated afterwards.
Before the final roll-up, you have to ensure that your end users can handle the new technology environment, meaning that all of them understand what information is available, what it means, how to access it and what tools to use. Customized training for both standard and power users as well as support documentation will help with that. Besides that, you need to:
After the initial deployment, you need to focus on your business users and provide ongoing support and education. Over time, data warehouse performance metrics and user satisfaction scores will have to be measured, as it’ll help you ensure the long-term health and growth of your data warehouse.
Project manager
Business analyst
Data modeler
Data warehouse database administrator (DBA)
ETL developer
Quality assurance engineer
Besides these key roles, other professionals may participate in the project as well, such as a solution architect, a tech support specialist, a DevOps engineer, a data steward, a data warehouse trainer, etc. It is worth noting that sometimes individual staff members may perform several roles.
Using inappropriate technology is one of the reasons why data warehouse projects fail. Besides the fact that you need to correctly identify your use case, you also need to choose the optimal software from numerous seemingly similar options available on the market. Here, we review data warehouse services and platforms that have great customer satisfaction scores, are rated highly in various market research reports, and embrace the principles of data warehouse modernization. The described functionality is not exhaustive though: while drawing up their descriptions, we mainly concentrated on their data integration capabilities, built-in connectivity with analytics and business intelligence services, reliability, and data security.
Amazon Redshift
Google BigQuery
Azure Synapse Analytics