In a world where self-service and speed of delivery are key priorities in any reporting BI solution, the importance of creating a comprehensive and performant data semantic model layer is sometimes overlooked.
I have seen quite a few occurrences where the relational data store such as Azure SQL Database and Azure SQL Data Warehouse are well structured, the reporting tier is well presented whether that is SQL Server reporting services or Power BI, but still, the performance is not as expected.
Before we drill down to the data semantic model, I always advise that understanding your data and how you want to present and report on it is key. By creating a report that takes the end consumer through a data journey is the difference between a good and a bad report. Report designing should take into account who is consuming and what they want to achieve out of the report. For example, if you have a small number of consumers who need to view a lower level of hierarchical data with additional measures or KPIs, then it may not be suitable to visualize this on the first page. As the majority of consumers may want to view a more aggregated view of the data. This example could lead to the first page of the report taking longer to return the data, thus giving a perception of a slow running report to the majority of consumers. To achieve a better experience, we could take the consumers through a data journey to ensure that the detail level does not impact the higher-level data points.
Setting the correct service level agreements and performance expectation is key. Setting an SLA for less than two seconds may be achievable if the data is two million rows. But would this still be met if it was two billion rows? There are many factors in understanding what is achievable, and the data set size is one of them. Network, compute, architecture patterns, data hierarchies, measures, KPIs, consumer device, consumer location, and real-time vs batch reporting are other impacts that can affect the perception of performance to the end consumer.
However, creating and/or optimizing your data semantic model layer will have a drastic impact on the overall performance.
The question I sometimes get is, with more computing power and the use of Azure, why can I not just report directly from my data lake or operational SQL Server? The answer is that reporting from data is very different from writing and reading data in an online transaction processing (OLTP) approach.
Dimensional modeling developed by Kimball has now been a data warehouse proven methodology and widely used for the last 20 plus years. The ideology behind the dimensional modeling is to be able to generate interactive reporting where consumers can retrieve calculations, aggregate data, and show business KPIs.
Due to creating dimensional models within a star or snowflake schema, you have the ability to retrieve the data in a more performant way due to the schema being designed for retrieval and reads rather than reads and writes, which are commonly associated with an OLTP database design.
In a star or snowflake schema, you have a fact table and many dimension tables. The fact table contains the foreign keys relating them to the dimension tables along with metrics and KPIs. The dimension tables contain the attributes associated with that particular dimension table. An example of this could be a date dimension table that contains month, day, and year as the attributes.
Creating dimensional modeling for data warehousing such as SQL Server or Azure SQL Data Warehouse will assist in what you are trying to achieve out of a reporting solution. Again, traditional warehouses have been deployed widely over the last 20 years. Even with this approach, creating a data semantic model on top of the warehouse can improve performance as well as things like improving concurrency and even adding an additional layer of security between the end consumers and the source warehouse data. SQL Server Analysis Services and now Azure Analysis Services have been designed for this purpose. We can essentially serve the required data from the warehouse into a model that can then be consumed as below.
Traditionally, the architecture was exactly that. Ingest data from the warehouse into the cube sitting on an instance of Analysis Services, process the cube, and then serve to SQL Server Reporting Services, Excel, and more. The landscape of data ingestion has changed over the last five to ten years with the adoption of big data and the end consumers wanting to consume a wider range of data sources, whether that is SQL, Spark, CSV, JSON, or others.
Modern BI reporting now needs to ingest, mash, clean and then present this data. Therefore, the data semantic layer needs to be agile but also delivering on performance. However, the principles of designing a model that aligns to dimensional modeling are still key.
In a world of data services in Azure, Analysis Services and Power BI are good candidates for building data semantic models on top of a data warehousing dimensional modeling. The fundamental principles of these services have formed the foundations for Microsoft BI solutions historically, even though they have evolved and now use modern in-memory architecture and allow agility for self-service.
Power BI and Analysis Services Tabular Model
SQL Server Analysis Services Tabular model, Azure Analysis Services, and Power BI share the same underlining fundamentals and principles. They are all built using the tabular model which was first released on SQL Server in 2012.
SQL Server Analysis Services Multi Dimension is a different architecture and is set at the server configuration section at the install point.
The Analysis Service Tabular model (in Power BI) is built on the columnar in-memory architecture, which forms the VertiPaq engine.
At processing time, the rows of data are converted into columns, encoded, and compressed allowing more data to be stored. Due to the data being stored in memory the analytical reporting delivers high performance, versus retrieving data from disk-based systems. The purpose of in-memory tabular models is to minimize read times for reporting purposes. Again, understanding the consumer reporting behavior is key, tabular models are designed to retrieve a small number of columns. The balance here is that when you retrieve a high number of columns, the engine needs to sort back into rows of data and decompress, which impacts compute.
Best practices for data modeling
The best practices below are some of the key observations I have seen over the last several years, particularly when creating data semantic models in SQL Server Analysis Services, Azure Analysis Services, or Power BI.
- Create a dimension model star and/or snowflake, even if you are ingesting data from different sources.
- Ensure that you create integer surrogate keys on dimension tables. Natural keys are not best practice and can cause issues if you need to change them at a later date. Natural keys are generally strings, so larger in size and can perform poorly when joining to other tables. The key point in regards to performance with tabular models is that natural keys are not optimal for compression. The process with natural keys is that they are:
- Encoded, hash/dictionary encoding.
- Foreign keys encoded on the fact table relating to the dimension table, again hash/dictionary encoding.
- Build the relationships.
- This has an impact on performance and reduces the available memory for data as a proportion, which will be needed for the dictionary encoding.
- Only bring into the model the integer surrogate keys or value encoding and exclude any natural keys from the dimension tables.
- Only bring into the model the foreign keys or integer surrogate keys on the fact table from the dimension tables.
- Only bring columns into your model that are required for analysis, this may be excluding columns that are not needed or filter on data to only bring the data in that is being analyzed.
- Reduce cardinality so that the values uniqueness can be reduced, allowing for much greater compression.
- Add a date dimension into your model.
- Ideally, we should run calculations at the compute layer if possible.
The best practices noted above have all been used in part or collectively to improve the performance for the consumer experience. Once the data semantic models have been created to align with best practices, then performance expectations can be gauged and aligned with SLA’s. The key focus on the best practices above is to ensure that we utilize the VertiPaq in-memory architecture. A large part of this is to ensure that data can be compressed as much as possible so that we can store more data within the model but also so that we can report upon the data in an efficient way.
In my next blog, I will demonstrate how Microsoft Power BI is exposing more Analysis Services native functionality which makes it possible to consolidate the data semantic model.