What is a data warehouse?
First, let's define what a data warehouse is and why you might want to use one for your organization.
A data warehouse is a centralized repository that stores structured data (database tables, Excel sheets) and semi-structured data (XML files, webpages) for the purposes of reporting and analysis. The data flows in from a variety of sources, such as point-of-sale systems, business applications, and relational databases, and it is usually cleaned and standardized before it hits the warehouse. Because a data warehouse can store large amounts of information, it provides users with easy access to a wealth of historical data, which can be used for data mining, data visualization, and other forms of business intelligence reporting.
Benefits of data warehouse
Reliable data, especially when aggregated over time, helps users make smarter, more informed decisions about the way they run their organization—and data warehouses are what makes that possible. The benefits of enterprise data warehousing are myriad, but some of the most impactful advantages include:
Consolidating data from multiple sources into one single source of truth
Storing and analyzing long-term historical data spanning months and years
Cleansing and transforming data so that it is accurate, consistent, and standardized in structure and form
Reducing query times when gathering data and processing analytics, which improves overall performance across systems
Efficiently loading data without having to deal with the costs of deployment or infrastructure
Securing data so that it is private, protected, and safe
Preparing data for analysis through data mining, visualization tools, and other forms of advanced analytics
Data warehouse vs. data lake
It's clear that data warehouses are essential to any organization's analytics operations. But what's the difference between a data warehouse and other types of data repositories, such as a data lake? And when should one be used over the other?
As repositories, data warehouses and data lakes both store and process data. Yet though they may seem to offer the same functionality, they each have their own particular use cases. This is why organizations commonly incorporate both systems to form a complete, end-to-end solution that can handle a wide range of purposes.
A data warehouse is relational in nature. This means that the structure or schema of the data is determined by predefined business and product requirements that are curated, conformed, and optimized for SQL query operations. As a result, data warehouses are best used for storing data that has been treated with a specific purpose in mind, such as data mining for BI analysis, or for sourcing a business use case that has already been identified.
Like data warehouses, data lakes hold structured and semi-structured data. Yet they are also capable of accommodating raw and unprocessed data from a variety of non-relational sources, including mobile apps, IoT devices, social media, or streaming. This is because structure or schema in a data lake isn't defined until the data is read. As a result of their flexible, scalable nature, data lakes are often used for performing intelligent forms of data analysis, such as machine learning.
|Data lake||Data warehouse|
Structured, semi-structured, unstructured
Schema on read
Schema on write
Big data, IoT, social media, streaming data
Application, business, transactional data, batch reporting
Easy to scale at a low cost
Difficult and expensive to scale
Data scientists, data engineers
Data warehouse professionals, business analysts
Machine learning, predictive analytics, real-time analytics
Core reporting, BI
Data warehouse architecture and design
Now that you know why and when you should use a data warehouse, let's dive into how one works by looking at data warehouse design. A data warehouse is more than just a single silo operating on its own. Rather, it is a highly structured, carefully architected system composed of multiple tiers that interact with your data—and each other—in different ways. Typically, these tiers include:
The bottom tier
Data is ingested from multiple sources, then cleansed and transformed for other applications to use in a process called extract, transform, and load (ETL). The bottom tier is also where data is stored and optimized, which leads to faster query times and better performance overall.
This is where you'll find the analytics engine, also known as the online analytical processing (OLAP) server. OLAP servers access large volumes of data from the data warehouse at a high speed, which leads to lightning-fast results.
The top tier is where the front-end interface visually presents the processed data, which analysts may access and use for all their reporting and self-service BI needs.
How to build a data warehouse
When designing and building a data warehouse, it's important to consider the goals of your organization, both long-term and ad-hoc, as well as the nature of your data. How many data sources are you integrating? Do you plan on automating your workflows? How will you explore and analyze your data? Your build-out will vary depending on the complexity of your needs, but a typical enterprise database warehouse may consist of the following components:
- Data sources that extract operational data from point-of-sale systems, business applications, and other relational databases
- A staging area where data is cleaned and transformed for the warehouse or centralized repository
- A warehouse or centralized repository which stores processed operational data, metadata, summary data, and raw data for easy user access
- The addition of data marts, which takes data from the centralized repository and serves it in subsets to selected groups of users
- A sandbox, which data scientists may use to test new forms of data exploration in a protected environment
- A wide variety of data warehousing tools, frameworks, and APIs for integration, storage, performance, and analysis
Data warehouse tools, software, and resources
In today's data-centric world, plenty of major software companies boast a seemingly endless range of data warehouse software, each with its own specific use case. It may seem daunting, but in order to build a cohesive, high-performance solution, you'll want to invest in the right tools and technologies. Every organization's needs are different, but here are some essential data warehouse products to look into:
Cloud and hybrid cloud data warehousing
A unified, cloud-based data warehousing solution, such as Azure Synapse Analytics, gives organizations the ability to scale, compute, and store at a faster speed and lower cost.
Data integration tools
ETL pipelines enable users to create, schedule, and orchestrate their workflows so that source data is automatically integrated, cleansed, and standardized.
An object storage solution can hold large amounts of structured, semi-structured, and unstructured data, which makes it perfect for staging source data before it's loaded into the warehouse.
A distributed storage solution holds large sets of data in relational tables with columnar storage. This greatly lowers costs, improves query performance, and speeds up time to insight.
To boost the performance of your applications, you may want to incorporate Apache Spark, an open-source parallel processing framework that supports in-memory processing.
Resource and workload management
A resource manager allocates computing power to your workloads so that you may load, analyze, manage, and export data accordingly.
Data modeling combines multiple data sources into a single semantic model, providing a structured, streamlined view of your data.
Business intelligence tools
Business analytics tools help deliver insights to users in the form of dashboards, reports, and other visualization tools.
Security and privacy features
Security and compliance features like data encryption, user authentication, and access monitoring ensure that your data stays protected.
What happened to Azure SQL Data Warehouse?
The capabilities associated with Azure SQL Data Warehouse are now a feature of Azure Synapse Analytics called dedicated SQL pool. Existing Azure SQL Data Warehouse customers can continue running their existing Azure SQL Data Warehouse workloads using the dedicated SQL pool feature in Azure Synapse Analytics without going through any changes. Customers can also start managing their existing warehouse data with Azure Synapse Analytics to take advantage of advanced analytics features like serverless data lake exploration and integrated SQL and Apache Spark™ engines.
Frequently asked questions
A data warehouse is a centralized repository that holds structured data (database tables, Excel sheets) and semi-structured data (XML files, webpages) for the purposes of reporting, analysis, and other forms of business intelligence.
There are many benefits to using a data warehouse. For instance, a data warehouse consolidates multiple sources of data into a single source of truth, which organizations can then use to make more informed decisions around business and operations.
Data warehouses store structured and semi-structured data, which can be used to source data mining, data visualization, and other specific BI use cases. Data lakes store various types of raw data, which data scientists can then use to source a variety of projects.
A data warehouse is typically composed of multiple tiers: the bottom tier, where data is collected and stored; the middle tier, where data is analyzed; and the top tier, where the data is displayed for users to access and parse through.
When designing and building data warehouse infrastructure, it's important to consider the nature of your data and how you'd like to transform it. Some common elements of a typical build-out include data sources, a staging area, the warehouse itself, data marts, sandboxes, and various integration tools.
Many major software companies now boast a wide range of data warehouse products.
These capabilities are now a feature of Azure Synapse Analytics called dedicated SQL pool. Existing Azure SQL Data Warehouse customers can continue running their workloads here without going through any changes.