• <1 minute

Creating Primary Keys Across Databases

[This article was contributed by the SQL Azure team.]When you horizontally partitioning data across multiple SQL Azure databases or using Data Sync Server for SQL Azure, there might come a time when…

[This article was contributed by the SQL Azure team.]

When you horizontally partitioning data across multiple SQL Azure databases or using Data Sync Server for SQL Azure, there might come a time when you need to write to a member database without causing primary key merge conflicts. In this case you need to be able to generate a primary key that is unique across all databases. In this article we will discuss different techniques to generate primary keys and their advantages and disadvantage.

UniqueIdentifier

One way to generate a unique primary keys is to use the NEWID() function in Transact-SQL, which generates a GUID as a uniqueidentifier data type. The GUID is guaranteed to be unique across all databases.

Advantages:

  • It is a native type to SQL Azure.
  • It is infinitely big and you will never run out of GUIDs
  • Works with both horizontal partitioning and Data Sync Services.

Disadvantage:

  • The disadvantages of using this technique is that based on the GUID; there is no way to identify what database generated it. This can cause extra complications when doing horizontal partitioning.
  • The uniqueidentifier data type is large and will add to the size of your row.

Bigint

Another option is to use a bigint data type in place of an int. In this technique, the primary key is generated from being an identity column; however each identity in each database starts at a different offset. Different offset create the non-conflicting primary keys.

The first question most people ask, is bigint data type big enough to represent all the primary keys need. The bigInt data type can be as large as 9,223,372,036,854,775,807 because it is stored in 8 bytes. This is 4,294,967,298 times bigger than the maximum size of an int data type: 2,147,483,647. This means that you could potentially have 4 billion SQL Azure databases horizontally partitioned with tables of around 2 billion rows. More information about data types and sizes can be found here.

On the first SQL Azure database you would create the table like this:

CREATE TABLE TEST(x bigint PRIMARY KEY IDENTITY (1,1))

On the second SQL Azure database you would create the table like this:

CREATE TABLE TEST(x bigint PRIMARY KEY IDENTITY (2147483648,1))

And continue incrementing the seed value for each database in the horizontal partitioning.

Advantages:

  • It is easier to upgrade from a legacy tables that used an int data type as the primary key to a bigint data type (the legacy table would be the first partition).
  • You can reparation easier than some of the other techniques, since moving rows involve a straight forward case statement (not a recalculated hash).
  • The data tier code implementing the partitioning can figure out which partition that the primary key is in, unlike a using a uniqueidentifier for a primary key.
  • The bigint data type consumes 8 bytes of space, which is smaller than the uniqueidentifier data type that take up 16 bytes.

Disadvantages:

  • The database schema for each partition is different.
  • This technique works well for horizontal partitioning, but not for Data Sync Service.

Primary Key Pool

In this technique a single identity database is built where all the primary keys are stored, however none of the data. This identity database just has a set of matching tables that contain a single column of integers (int data type) as an auto incrementing identity. When an insert is needed on any of the tables across the whole partition, the data tier code inserts into the identity database and fetches the @@IDENTITY. This primary key from the identity database is used as the primary key to insert into the member database or the partition. Because the identity database is generating the keys there is never a conflict.

So how many integers can a 50 Gigabyte SQL Azure database hold? This is a good question, because if you run out of space on your database acting as the primary key pool, then you can’t insert anymore rows. If all your tables where single column integers in the primary key database you could have 25,000 tables with two million rows (table size of 2 Megabytes) in a 50 Gigabyte SQL Azure database. 50 Gigabytes is currently the largest SQL Azure database you could use for your primary key database. Or some combination of that, like 12,000 tables of 4 million rows, or 6,000 tables of 8 million rows.

Advantages:

  • This is the easiest technique to implement with legacy tables; there are no data type changes. However, the IDENTITY attribute needs to be removed from the data tables.
  • Works with both horizontal partitioning and Data Sync Services.

Disadvantages:

  • This technique works best in low write high read scenarios where contention for the primary key database isn’t an issue.
  • Every INSERT requires an extra query to the primary database, which can causes performance issues, especially if the data database and the primary database are not in the same data center.
  • There is no way for the data tier layer to identify in what partition that the data is located in from the primary key. For this reason it works best with Data Sync Services, where you have a known member database you are reading and write too.
  • There is a constraint on the number of primary keys (int data type) you can hold in the largest 50 Giga byte SQL Azure database, which is a limitation on the number rows in your overall partition.

Double Column Primary Key

Another technique is to use two columns to represent the primary key. The first column is an integer that specifies the partition or the member database. The second column is an int IDENTITY, that auto increments. With multiple member or partition databases the second column would have conflicts, however together the two columns would create a unique primary key.

Here is an example of a CREATE TABLE statement with a double column primary key:

CREATE TABLE TEST ([partition] int,       [id] int IDENTITY,       CONSTRAINT PK_TEST           PRIMARY KEY([partition], [id]));

Remember you need to add a second column for all the primary keys, and a second column to all foreign key references.

Advantages:

  • It is easier to upgrade from a legacy tables by adding an additional column then it is to convert those tables to uniqueidentifier primary keys.
  • Two integers consume 8 bytes of space, which is smaller than the uniqueidentifier data type that take up 16 bytes.
  • Works with both horizontal partitioning and Data Sync Services.
  • The data tier code implementing the partitioning can figure out which partition that the primary key is in, unlike a using a uniqueidentifier for a primary key.

Disadvantages:

  • It feels unnatural and cumbersome to have two columns be your primary key.
  • The data tier code needs to keep track of two integers to access the primary key. This can be obfuscated by using language elements to create a new data type, like a custom struct in C#.

Summary

Do you have a favorite technique for solving the same problem; share it with us in the comments below. Or give us your opinion about what technique is the best. Have questions, concerns, comments? Post them below and we will try to address them.