• 4 min read

Clustered Columnstore Index in Azure SQL Database

Columnstore index is the preferred technology to run analytics queries in Azure SQL Databases. We recently announced general availability  if In-Memory technologies for all Premium databases.…

Columnstore index is the preferred technology to run analytics queries in Azure SQL Databases. We recently announced general availability if In-Memory technologies for all Premium databases. Similar to In-Memory OLTP, the columnstore index technology is available in premium databases.

The columnstore technology is available in two flavors; clustered columnstore index (CCI) for DataMart analytics workloads and nonclustered columnstore index (NCCI) to run analytics queries on operational (i.e. OLTP) workload. Please refer to NCCI vs CCI for the differences between these two flavors of columnstore indexes. The columnstore index can speed up the performance of analytics queries up to 100x while significantly reducing the storage footprint. The data compression achieved depends on the schema and the data, but we see around 10x data compression on average when compared to rowstore with no compression. This blog will focus on Analytic workloads using CCI but cover NCCI in a future blog.

Clustered Columnstore index is available in Azure SQL Databases across all premium editions. However, it is not yet available on the Standard and Basic pricing tiers. Using this technology in Azure SQL Databases, you can lower the storage cost and getting a similar or better query performance on lower premium tiers.

The tables below show a typical analytics query with multi-table join running on P1 and P15 both with/without clustered columnstore index and storage savings achieved

Query Performance: Key point to note below is that with clustered columnstore index, the example query runs 5x faster on P1 compared to the same query running on P15 with rowstore with no tuning.  This can significantly lower the cost you need to pay to meet your workload requirements.

Pricing Tier With Rowstore With Columnstore Performance Gains
P1 30.6 secs 4.2 secs 14x
P15 19.5 secs 0.319 secs 60x

Storage Size: The storage savings with columnstore compared to PAGE or NONE compressed tables shown below. While the cost of storage is already included with AzureDB, but lower storage can enable you to choose a lower tier. Note, this is generated test data so the compression is lower than what one would get for customer workloads.

Number of Rows Size Rowstore (MB) Size columnstore (MB) Savings
3626191 212 (PAGE compression) 120 1.8x
3626191 756 (NONE compression) 120MB 6.2x

The best part of columnstore index technology is that it does not require any changes to your application. All you need to do is to either create or replace an existing index with columnstore index on your table(s).

How does Columnstore Index work?

As described earlier, the columnstore is just an index that stores data in a table as columns as shown below. The queries can continue to access the table requiring no changes.

chart 1

Columnstore index delivers significant data compression and query performance due to the following three key factors

  • Reduced IO and Storage: Since data is stored as individual columns, it compresses really well as all values are drawn from the same domain (i.e. data type) and in many cases, the values repeat or are similar. The compression will depend on the data distribution but typical compression that we have seen is around 10x. This is significant because it enables you reduce the storage as the IO footprint of your database significantly.
  • Only Referenced columns need to be fetched: Most analytics queries fetch/process only a small set of columns. If you consider a typical Star Schema,  the FACT table is the one with most rows and it has large number of columns. With columnstore storage, SQL Server needs to fetch only the referenced columns unlike rowstore where the full row needs to be fetched regardless of number of columns referenced in the query. For example, consider a FACT table with 100 columns and an analytic query accessing this tables references only 5 columns. Now, by fetching only the referenced columns, you can potentially reduce IO by 95% with simplifying assumption that all columns take same storage. Note, this is on top of already 10x data compression provided by columnstore.
  • Efficient Data Processing: SQL Server has an industry leading query engine for columnstore data to deliver up to 100x speed up in query performance. For details, please refer to Speeding up Analytics Queries.

How do I create clustered columnstore index?

Creating a clustered columnstore index is like creating any other index. For example, I can create a regular rowstore table as follows

CREATE TABLE ACCOUNT (
    ACCOUNTKEY                INT NOT NULL,
    ACCOUNTDESCRIPTION        NVARCHAR (50),
    ACCOUNTTYPE               NVARCHAR (50),
    ACCOUNTCODEALTERNATEKEY   INT)

Any rows inserted into the table above are stored in rowstore format. Now, if you want to convert this table to store data in ‘columnstore’, all you need to do is to execute the following SQL statement

CREATE CLUSTERED COLUMNSTORE index ACCOUNT_CI on ACCOUNT

If the rowstore table had a clustered BTREE index, then you can execute the following SQL Statement

CREATE CLUSTERED COLUMNSTORE index ACCOUNT_CI on ACCOUNT WITH (DROP_EXISTING = ON)

When and where should you use clustered columnstore Index?

Clustered Columnstore index primarily targets analytics workloads. The table below shows the common scenarios that have been successfully deployed with this technology.

Columnstore Option Workload Compression
CCI (clustered columnstore index)
  • Traditional DW workload with Star or Snowflake schema: Commonly you enable CCI on the FACT table but keep DIMENSION tables with rowstore with PAGE compression.
    Additional Considerations: consider CCI for large dimension tables with > 1 million rows
  • Insert mostly workload: Many workloads such as IOT (Internet of things) insert large volume of data with minimal updates/deletes. These workloads can benefit with huge data compression as well as speed up of analytic queries.

10x on average

CCI/NCI (with one or more nonclustered indexes)
  • Similar to the ones mentioned with CCI but require (a) PK/FK enforcements (b) significant number of queries with equality predicate or short range queries. NCIs speed up the query performance by avoiding full table scans (c) update/delete deletes of rows which can be efficiently located using NCIs.

10X on average + additional storage for NCIs

Resources to get started

For more details, please refer to the following