Trace Id is missing
Skip to main content
Four people discussing face to face with presentation on laptop

What is a data warehouse?

Learn what a data warehouse is, the benefits of using one, best practices to consider during the design phase, and which tools to incorporate when it's finally time to build.

What is a data warehouse?

First, let's define what a data warehouse is and why you might want to use one for your organization.

A data warehouse is a centralized repository that stores structured data (database tables, Excel sheets) and semi-structured data (XML files, webpages) for the purposes of reporting and analysis. The data flows in from a variety of sources, such as point-of-sale systems, business applications, and relational databases, and it is usually cleaned and standardized before it hits the warehouse. Because a data warehouse can store large amounts of information, it provides users with easy access to a wealth of historical data, which can be used for data mining, data visualization, and other forms of business intelligence reporting.

Two persons are checking report data in a table

Benefits of data warehouse

Reliable data, especially when aggregated over time, helps users make smarter, more informed decisions about the way they run their organization—and data warehouses are what makes that possible. The benefits of enterprise data warehousing are myriad, but some of the most impactful advantages include:

Data warehouse vs. data lake

It's clear that data warehouses are essential to any organization's analytics operations. But what's the difference between a data warehouse and other types of data repositories, such as a data lake? And when should one be used over the other?

As repositories, data warehouses and data lakes both store and process data. Yet though they may seem to offer the same functionality, they each have their own particular use cases. This is why organizations commonly incorporate both systems to form a complete, end-to-end solution that can handle a wide range of purposes.

A data warehouse is relational in nature. This means that the structure or schema of the data is determined by predefined business and product requirements that are curated, conformed, and optimized for SQL query operations. As a result, data warehouses are best used for storing data that has been treated with a specific purpose in mind, such as data mining for BI analysis, or for sourcing a business use case that has already been identified.

Like data warehouses, data lakes hold structured and semi-structured data. Yet they are also capable of accommodating raw and unprocessed data from a variety of non-relational sources, including mobile apps, IoT devices, social media, or streaming. This is because structure or schema in a data lake isn't defined until the data is read. As a result of their flexible, scalable nature, data lakes are often used for performing intelligent forms of data analysis, such as machine learning.

organization-wide support plan
Data lake Data warehouse
Type

Structured, semi-structured, unstructured
Relational, non-relational

Structured
Relational

Schema

Schema on read

Schema on write

Format

Raw, unfiltered

Processed, vetted

Sources

Big data, IoT, social media, streaming data

Application, business, transactional data, batch reporting

Scalability

Easy to scale at a low cost

Difficult and expensive to scale

Users

Data scientists, data engineers

Data warehouse professionals, business analysts

Use cases

Machine learning, predictive analytics, real-time analytics

Core reporting, BI

A person is sitting on a chair and working on desktop

Data warehouse architecture and design

Now that you know why and when you should use a data warehouse, let's dive into how one works by looking at data warehouse design. A data warehouse is more than just a single silo operating on its own. Rather, it is a highly structured, carefully architected system composed of multiple tiers that interact with your data—and each other—in different ways. Typically, these tiers include:

The bottom tier

Data is ingested from multiple sources, then cleansed and transformed for other applications to use in a process called extract, transform, and load (ETL). The bottom tier is also where data is stored and optimized, which leads to faster query times and better performance overall.

Middle tier

This is where you'll find the analytics engine, also known as the online analytical processing (OLAP) server. OLAP servers access large volumes of data from the data warehouse at a high speed, which leads to lightning-fast results.

Top tier

The top tier is where the front-end interface visually presents the processed data, which analysts may access and use for all their reporting and self-service BI needs.

How to build a data warehouse

When designing and building a data warehouse, it's important to consider the goals of your organization, both long-term and ad-hoc, as well as the nature of your data. How many data sources are you integrating? Do you plan on automating your workflows? How will you explore and analyze your data? Your build-out will vary depending on the complexity of your needs, but a typical enterprise database warehouse may consist of the following components:

  1. Data sources that extract operational data from point-of-sale systems, business applications, and other relational databases
  2. staging area where data is cleaned and transformed for the warehouse or centralized repository
  3. warehouse or centralized repository which stores processed operational data, metadata, summary data, and raw data for easy user access
  4. The addition of data marts, which takes data from the centralized repository and serves it in subsets to selected groups of users
  5. sandbox, which data scientists may use to test new forms of data exploration in a protected environment
  6. A wide variety of data warehousing tools, frameworks, and APIs for integration, storage, performance, and analysis
A person analysing charts on the laptop and data reports on two montiors
Laptop screen with charts open

Data warehouse tools, software, and resources

In today's data-centric world, plenty of major software companies boast a seemingly endless range of data warehouse software, each with its own specific use case. It may seem daunting, but in order to build a cohesive, high-performance solution, you'll want to invest in the right tools and technologies. Every organization's needs are different, but here are some essential data warehouse products to look into:

Cloud and hybrid cloud data warehousing

A unified, cloud-based data warehousing solution, such as Azure Synapse Analytics, gives organizations the ability to scale, compute, and store at a faster speed and lower cost.

Data integration tools

ETL pipelines enable users to create, schedule, and orchestrate their workflows so that source data is automatically integrated, cleansed, and standardized.

Object storage

An object storage solution can hold large amounts of structured, semi-structured, and unstructured data, which makes it perfect for staging source data before it's loaded into the warehouse.

Warehousing tools

A distributed storage solution holds large sets of data in relational tables with columnar storage. This greatly lowers costs, improves query performance, and speeds up time to insight.

Performance tools

To boost the performance of your applications, you may want to incorporate Apache Spark, an open-source parallel processing framework that supports in-memory processing.

Resource and workload management

A resource manager allocates computing power to your workloads so that you may load, analyze, manage, and export data accordingly.

Data modeling

Data modeling combines multiple data sources into a single semantic model, providing a structured, streamlined view of your data.

Business intelligence tools

Business analytics tools help deliver insights to users in the form of dashboards, reports, and other visualization tools.

Security and privacy features

Security and compliance features like data encryption, user authentication, and access monitoring ensure that your data stays protected.

Two person discussing and person in the right is holding laptop

What happened to Azure SQL Data Warehouse?

The capabilities associated with Azure SQL Data Warehouse are now a feature of Azure Synapse Analytics called dedicated SQL pool. Existing Azure SQL Data Warehouse customers can continue running their existing Azure SQL Data Warehouse workloads using the dedicated SQL pool feature in Azure Synapse Analytics without going through any changes. Customers can also start managing their existing warehouse data with Azure Synapse Analytics to take advantage of advanced analytics features like serverless data lake exploration and integrated SQL and Apache Spark™ engines.

Frequently asked questions

  • A data warehouse is a centralized repository that holds structured data (database tables, Excel sheets) and semi-structured data (XML files, webpages) for the purposes of reporting, analysis, and other forms of business intelligence.

  • There are many benefits to using a data warehouse. For instance, a data warehouse consolidates multiple sources of data into a single source of truth, which organizations can then use to make more informed decisions around business and operations.

  • Data warehouses store structured and semi-structured data, which can be used to source data mining, data visualization, and other specific BI use cases. Data lakes store various types of raw data, which data scientists can then use to source a variety of projects.

  • A data warehouse is typically composed of multiple tiers: the bottom tier, where data is collected and stored; the middle tier, where data is analyzed; and the top tier, where the data is displayed for users to access and parse through.

  • When designing and building data warehouse infrastructure, it's important to consider the nature of your data and how you'd like to transform it. Some common elements of a typical build-out include data sources, a staging area, the warehouse itself, data marts, sandboxes, and various integration tools.

  • Many major software companies now boast a wide range of data warehouse products.

  • These capabilities are now a feature of Azure Synapse Analytics called dedicated SQL pool. Existing Azure SQL Data Warehouse customers can continue running their workloads here without going through any changes.

Free Account

Try Azure Cloud Computing services free for up to 30 days.

Pay as you go

Get started with pay-as-you-go pricing. There's no upfront commitment—cancel anytime.