This post was co-authored by Thakur Harsh Raj Singh, Senior SWE Lead, Enterprise Service BI.
Azure SQL Data Warehouse is a SQL-based fully managed, petabyte-scale cloud solution for data warehousing. With tight integration between other Azure products, SQL Data Warehouse represents the backbone of any major cloud data warehousing solution in Azure. With decoupled compute and storage, elastic scale-out functionality, and the ability to pause and resume to meet demand, SQL Data Warehouse gives customers the flexibility, cost-savings, and performance they need, right when they need it.
Microsoft Enterprise Services (ES) is one of the largest consulting and support businesses in the world. Microsoft ES operates across more than 100+ subsidiaries around the globe operating in spaces such as cloud productivity, mobility solutions, adoption services, and more. With the speed and scale at which clients’ operations transform and grow, it is paramount that ES stays ahead of the curve to meet future demand.
Right data at the right time allows ES to make the best decision about where to fund resources and how to best serve their customers. Traditional data warehouse reporting stacks took far too long to deliver reports and were far too inflexible to change models. Adding to the costs of complexity, maintenance, and scaling to match growing data, their traditional on-premise data warehouse was producing less value than the cost of upkeep and development, distracting from the core business value of delivering insights.
The move to a modern data warehouse solution was becoming readily apparent. ES analytics split their workload into data processing and distribution. Core requirements for ES included easy scalability, high IO, multi-terabyte storage, row level data security, and support of more than 200 concurrent users. Working through the host of Azure service offerings, ES landed on a solution with Azure SQL Data Warehouse as their data processing and ad-hoc analysis workload with IaaS SQL Server 2016 Always On instances as their data distribution layer.
Implementation
The implementation for ES at a high level takes audit and historical data from a variety of sources on-premise which first land into Azure Storage Blobs. Polybase is then used to load data in parallel quickly into the Azure SQL Data Warehouse where it is then processed and transformed into dimension and fact tables. Afterwards, these dimension and fact tables are moved into Analysis Services and SQL Server IaaS instances to support quick and highly concurrent access a variety of business users. Across this solution, Azure Data Factory acts as the orchestrating ELT framework, allowing for a single interface to control the data flow for the majority of the pipeline.
8 Tips for SQL Data Warehouse
Putting together a solution like this, while performant, is not a trivial task. Listed below is some guidance straight from the ES team on designing solutions with Azure SQL Data Warehouse:
1. Stage the data in Azure SQL DW:
One of the guiding principles of our warehouse has been to stage the data in its native form, i.e. the way it is available in source. There are various reasons as to why we persist a copy of the source like performance, data quality, data persistence for validation, etc… because of staging the data, we are able to distribute data as per our needs and ensure we have minimal data skew. Rarely, if at all, have we used round robin mechanism to store data.
2. Arriving at a common distribution key:
The data staged into the Azure SQL Data Warehouse instance was ingested from a 3NF data source. This helped us to slightly change the schema from source and include the base table’s foreign key as part of all tables, which are distributed on the same key. During our fact load we join the tables on these set of keys thereby minimizing all DMS operations and in some cases, no DMS operations occur. Hence this gives us an edge in terms of performance. As an example, the data in source systems have one to many relationships between tables. However, in our SQL DW we have inserted a common distribution key across all tables, based on business logic and because this key gets loaded when ELT runs.
However, we would recommend checking the data skew before going ahead with this approach as the distribution key must be chosen completely based on the skew that one may see. When data skew is high or the joins are not compatible we create an interim table, which is distributed on the same key as the other join table. We us CTAS to accomplish this which incurs one DMS operation to get all keys but improves performance when there are complex joins.
3. Vertical Partitioning of Wide Tables:
We had a row size limitation of 32KB in Azure SQL DW. Since we had several wide tables with 150+ columns and many with varchar(4000) we came up with an approach to vertically partition the table on the same distribution key. This helped us to overcome the challenge of 32KB and at the same time provide the required performance while joining the two tables as the distribution key was the same.
Note: SQL Data Warehouse now supports 1MB wide rows
4. Use the right resource class:
In several cases we had complex facts that would need more resources (memory and CPU) to speed-up the process of fact loads. Not just facts, we also had dimensions which had complex business rules and type 2 kind of implementations. We designed our ETL in such a way that lesser complex facts and dimensions would run on smallrc resource class providing for more parallelism, whereas the more complex facts which would need more resources would run using largerc resource class.
5. Use the primary key as distribution column for master tables:
In the source from where we ingest data into SQL DW, we have many master tables that we use in SQL DW to look up these tables for building our facts. In such a case, we have made these tables with reasonable amount of data (>1 million rows) being distributed on the primary key, which is a unique integer. This has given us the advantage having even data distribution (minimal to no data skew), thereby making our look up queries really fast.
6. Using Dynamic Scale up and Scale down for saving costs:
Our ELTs using ADF are designed in such a way that prior to the scheduled ELT kick off, we scale up our instance from 100DWU to 600DWU. This has led to huge cost savings. Our ELT runs for nearly 4-5 hours during this time and the DWU usage is capped at 600 DWU. During month end when there is a faster need for processing and businesses need data faster, we have the option of scaling to 1000 DWU. All this is done as part of our ELT’s and no manual intervention is needed.
7. Regular Maintenance:
In our case, we included updating statistics and index rebuilds as part of the ELT process. No sooner the dimension and fact load is completed, we check for all the CCI’s where the fragmentation is > 5% and rebuild the index. Similarly, for the key tables, we are updating statistics to ensure best performance.
8. Leverage SQL Server 2016 for data marts:
Azure SQL DW is the primary data processing engine, whereas we chose to have SQL Server 2016 running on DS 14 IAAS VM’s are our primary source of data distribution. This has enabled us to leverage high concurrency provided by SQL Server and use the power of Azure SQL DW for processing. We have close to 1000+ users who would be using the data provisioned from Azure SQL DW.
Using Azure SQL Data Warehouse as part of their solution, Microsoft Enterprise Services was able to reduce run times by up to 40% and provide insights to business users with reduced latency.
If you have not already explored this fully managed, petabyte scale cloud data warehouse service, learn more at the links below.
Learn more
What is Azure SQL Data Warehouse?
SQL Data Warehouse best practices
Video library
MSDN forum
Stack Overflow forum