Comparing SELECT..INTO and CTAS use cases in Azure SQL Data Warehouse

Inlägg på 6 mars, 2017

Principal Program Manager

The team recently introduced SELECT..INTO to the SQL language of Azure SQL Data Warehouse. SELECT..INTO enables you to create and populate a new table based on the result-set of a SELECT statement. So now users have two options for creating and populating a table using a single statement. This post summarises the usage scenarios for both CTAS and SELECT..INTO and summarizes the differences between the two approaches:

Look at the example of SELECT..INTO below:

SELECT *

INTO [dbo].[FactInternetSales_new]

FROM [dbo].[FactInternetSales]

;

The result of this query is also a new round robin distributed clustered columnstore table called dbo.FactInternetSales_new. All done and dusted in three lines of code. Great!

Let’s now contrast this with the corresponding CTAS statement below:

CREATE TABLE [dbo].[FactInternetSales_new]

WITH

( DISTRIBUTION = HASH(Product_key)

, HEAP

)

AS

SELECT *

FROM [dbo].[FactInternetSales]

;

The result of this query is a new hash distributed heap table called dbo.FactInternetSales_new. Note that with CTAS you have full control of the distribution key and the organisation of the table. However, the code is more verbose as a result. With SELECT..INTO that code is significantly reduced and also might be more familiar.

With that said there are some important differences to be mindful of when using SELECT..INTO. There are no options to control the table organization or the distribution method. SELECT..INTO also always creates a round robin distributed clustered columnstore table. It is also worth noting that there is a small difference in behavior when compared with SQL Server and SQL Database. In SQL Server and SQL Database the SELECT..INTO command creates a heap table (the default table creation structure). However, in SQL Data Warehouse, the default table type is a clustered columnstore and so we follow the pattern of creating the default table type.

Below is a summary table of the differences between CTAS and SELECT..INTO:

  CTAS SELECT INTO
Distribution Key Any (full control) ROUND_ROBIN
Table type Any (full control) CLUSTERED COLUMNSTORE INDEX
Verbosity Higher (WITH section required) Lower (defaults fixed so no additional coding)
Familiarity Lower (newer syntax to Microsoft customers) Higher (very familiar syntax to Microsoft customers)

 

Despite these slight differences and variations there still several reasons for including SELECT..INTO in your code.

In my mind there are three primary reasons:

  1. Large code migration projects
  2. Target object is a round robin clustered columnstore index
  3. Simple cloning of a table.

When customers migrate to SQL Data Warehouse they are often times migrating existing solutions to the platform. In these cases the first order of business is to get the existing solution up and running on SQL Data Warehouse. In this case SELECT..INTO may well be good enough. The second scenario is the compact code scenario. Here a round_robin clustered columnstore table may be the desired option. In which case SELECT..INTO is much more compact syntactically. SELECT..INTO can also be used to create simple sandbox tables that mirror the definition of the source table. Even empty tables can created when paired with a WHERE 1=2 is used to ensure no rows are moved. This is a useful technique for creating empty tables when implementing partition switching patterns.

Finally, customers may not even realize they require SELECT..INTO support. Many customers use off the shelf ISV solutions that require support for SELECT..INTO. A good example might be a rollup Business Intelligence tool that generates its own summary tables using SELECT..INTO on the fly. In this case customers may be issuing SELECT..INTO queries without even realizing it.

For more information please refer to the product documentation for CTAS where the main differences are captured.