Editor's Note: This post was written by Arvind Ranasaria and Khalid Mouss.
As part of certifying SQL Server product on Windows Azure Virtual Machines service, SQL Server engineering team run 100’s of test suites on Windows Azure Virtual Machines during a period of several weeks. Additionally, there was a parallel effort running to migrate one of the key ticketing applications used for lab and infrastructure management from on premise servers to Windows Azure Virtual Machines. Both of these efforts highlighted some key learnings and some best practices that were adopted throughout this effort. SQL Server is a significant workload that runs on Windows Azure Virtual machines. In this blog, I would like to summarize some of the learnings and provide what has proved to work:
Provisioning/De-provisioning VMs at scale
During SQL Server certification on Windows Azure Virtual Machines, it is common that our labs has a need to provision dozens and sometime over 150 SQL Servers and Windows VMs to complete a full test run. This provisioning is done through custom tools’ automation. The most common scenario is to provision a number of these VMs, run a bunch of tests and workloads on them, once complete, tear down the VMs and re-create them again for the next run and so on - in most of the cases using the same VM names. A lot of these VMs are batched and put into separate cloud services due to current 50 VMs per cloud service limitation on Windows Azure.
After several rounds, we added a retry mechanism as the overall success rate can typically run at only 50% if especially if batch more than 10 VMs at the time. It is also important that most of these events were done under a single storage account. If your scenario calls for many VMs to be provisioned and de-provisioned at scale, we suggest the following:
- Try and catch for the errors 409 and have a retry mechanism in place.
- Spread multiple deployments across storage accounts (note: There is a default limit of 10 storage accounts per subscription which can be increased through a support ticket)
- If you see a high failure rate, try to reduce the number of parallel calls to provision / de-provision; this should help reduce the lock conflicts of multiple operation going to same storage account. Also reduce the batch of VMs to only less than 10 and preferable 5 VMs at the time.
There is on-going improvements from Windows Azure on this area, stay tuned for additional results and recommendation in this area in the next few weeks and months.
Known Limitations: 50 VMs at any one time provisioned per cloud Service.
Windows Update vs. Microsoft Update
- SQL Server gallery images have Windows Update turn on by default. Preset default is: daily at 3:00AM for that region
- To get SQL Updated automatically, ensure Microsoft Update is also turned on – details here. Microsoft Update may be turned ON by default in future release.
- Some Windows Updates require reset, so it is highly recommended to have VMs in an availability set and have a different windows update time on each replica.
Regenerate Service Master Key (SMK)
Have a key rotation policy to ensure not using the same Service Master Key for all deployment.
Recommendation 1: Run: ALTER SERVICE MASTER KEY REGENERATE; GO to generate a new key.
Recommendation2: this is a resource-intensive operation; run during period of low demand.
More details here. This will help ensure more secure SQL Server on Azure and that the same master key isn’t re-used.
Subscription and Resource management
If you are sharing an Azure subscription with multiple teams to use and expect a lot of resources to be managed across an organization, it is highly recommended to do to some resource planning and provide some process guidance for your organization prior to opening the subscription. This can save you a lot of time down the road when you need to ensure subscription usage is healthy and not have a large number of unused resources and and paying for unneeded capacity.
For more details and some recommendation on this topic visit this link - here.
VNet Configuration Issues
During our lift and shift of on premises ticketing application to Windows Azure Virtual Machines service, we ran into several issues with regards to VNet configuration.
- The Web tier machines and SQL Server tier machines were on two different VNets. Each of these Vnets was connected back to On Premises network using a VPN device and Windows Azure Virtual Networking features. This was done to allow for greater bandwidth for communication with On Premises network. However this led to Web tier machines connecting to Sql tier machines by doing round trips to On Premises Network instead of direct VM to VM communication within Windows Azure Network.
- The ticketing application in Windows Azure Virtual Machine environment needed to communicate with a server, say Server1, On Premises to move tickets from one ticketing system to another. During configuration of VNets that are connected to On Premises network, one is required to specify a list of on premises subnets that are reachable by VMs in the VNet in Windows Azure environment. The subnet that was hosting the server Server1 was missed out from this list and thus the application was not fully functioning. The resolution required reconfiguration of the Vnet and all VMs deployed to the Vnet.
Both of the above issues required redeployment of the VMs and the VNets. Unfortunately, changes to VNets cannot currently be done in an online fashion. So the recommendation is to do careful planning of VNet layouts and to ensure that all subnets On Premises that need to be reachable from the VMs in Windows Azure are included in the VNet configurations.
System disk backup/restore using COPY BLOB API
If you have a deployment that requires several VMs needed to run SQL Server in HA environment with SharePoint, domain controller, web front, app servers, etc. Backing up those machines after they are deployed is critical to maintain production environment.
VM Warm-up Phase
When a VM first comes up on a new Host machine, it takes some time for the local VM cache to get built up. During this phase, VMs perceived performance may be very low and UI experience will be very slow until the VM is fully running. The recommendation to mitigate this is by warming up the VM by doing typical functions that the running VM will be doing in production once.
Packet Reordering issue on some VPN devices.
Some VPN devices come with a security setting that does TCP Packet Reordering by default. If the On-Premises network has IPSec configured then this setting may cause issues and TCP Packet Reordering is not needed and should be turned off.