Azure Virtual Machine Internals – Part 1

Gepost op 11 januari, 2017

Partner Group Engineering Manager

Introduction

The Azure cloud services are composed of elements from Compute, Storage, and Networking. The compute building block is a Virtual Machine (VM), which is the subject of discussion in this post. Web search will yield large amounts of documentation regarding the commands, APIs and UX for creating and managing VMs. This is not a 101 or ‘How to’ and the reader is for the most part expected to already be familiar with the topics of VM creation and management. The goal of this series is to look at what is happening under the covers as a VM goes thru its various states.

Azure provides IaaS and PaaS VMs; in this post when we refer to a VM we mean the IaaS VM. There are two control plane stacks in Azure, Azure Service Management (ASM) and Azure Resource Manager (ARM). We will be limiting ourselves to ARM since it is the forward looking control plane.

ARM exposes resources like VM, NIC but in reality ARM is a thin frontend layer and the resources themselves are exposed by lower level resource providers like Compute Resource Provider (CRP), Network Resource Provider (NRP) and Storage Resource Provider (SRP). Portal calls ARM which in turn calls the resource providers.

Getting Started

For most of the customers, their first experience creating a VM is in the Azure Portal. I did the same and created a VM of size ‘Standard DS1 v2’ in the West US region. I mostly stayed with the defaults that the UI presented but chose to add a ‘CustomScript’ extension. When prompted I provided a local file ‘Sample.ps’ as the PowerShell script for the ‘CustomScript’ extension. The PS script itself is a single line Get-Process.

The VM provisioned successfully but the overall ARM template deployment failed (bright red on my Portal dashboard). Couple clicks showed that the ‘CustomScript’ extension had failed and the Portal showed this message:

{
  "status": "Failed",
  "error": {
    "code": "ResourceDeploymentFailure",
    "message": "The resource operation completed with terminal provisioning state 'Failed'.",
    "details": [
      {
        "code": "DeploymentFailed",
        "message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details.",
        "details": [
          {
            "code": "Conflict",
            "message": "{\r\n  \"status\": \"Failed\",\r\n  \"error\": {\r\n    \"code\": \"ResourceDeploymentFailure\",\r\n    \"message\": \"The resource operation completed with terminal provisioning state 'Failed'.\",\r\n    \"details\": [\r\n      {\r\n        \"code\": \"VMExtensionProvisioningError\",\r\n        \"message\": \"VM has reported a failure when processing extension 'CustomScriptExtension'. Error message: \\\"Finished executing command\\\".\"\r\n      }\r\n    ]\r\n  }\r\n}"
          }
        ]
      }
    ]
  }
}

It wasn’t immediately clear what had gone wrong. We can dig from here and as is often true, failures teach us more than successes.

I RDPed to the just provisioned VM. The logs for the VM Agent are in C:\WindowsAzure\Logs. The VM Agent is a system agent that runs in all IaaS VMs (customers can opt out if they would like). The VM Agent is necessary to run extensions. Let’s peek into the logs for the CustomScript Extension:

C:\WindowsAzure\Logs\Plugins\Microsoft.Compute.CustomScriptExtension\1.8\CustomScriptHandler

[1732+00000001] [08/14/2016 06:19:17.77] [INFO] Command execution task started. Awaiting completion...

[1732+00000001] [08/14/2016 06:19:18.80] [ERROR] Command execution finished. Command exited with code: -196608

The fact that the failure logs are cryptic hinted that something catastrophic had happened. So I re-looked at my input and realized that I had the file extension for the PS script wrong. I had it as Sample.ps when it should have been Sample.ps1. I updated the VM this time specifying the script file with the right extension. This succeeded as shown by more records appended to the log file mentioned above.

[3732+00000001] [08/14/2016 08:42:24.04] [INFO] HandlerSettings = ProtectedSettingsCertThumbprint: , ProtectedSettings: {}, PublicSettings: {FileUris: [https://iaasv2tempstorewestus.blob.core.windows.net/vmextensionstemporary-10033fff801becb5-20160814084130535/simple.ps1?sv=2015-04-05&sr=c&sig=M3qa7lS%2BZwp%2B8Tytqf1VEew4YaAKvvYn1yzGrPfSwyw%3D&se=2016-08-15T08%3A41%3A30Z&sp=rw], CommandToExecute: powershell -ExecutionPolicy Unrestricted -File simple.ps1 }

[3732+00000001] [08/14/2016 08:42:24.04] [INFO] Downloading files specified in configuration...

[3732+00000001] [08/14/2016 08:42:24.05] [INFO] DownloadFiles: fileUri = https://iaasv2tempstorewestus.blob.core.windows.net/vmextensionstemporary-10033fff801becb5-20160814084130535/simple.ps1?sv=2015-04-05&sr=c&sig=M3qa7lS+Zwp+8Tytqf1VEew4YaAKvvYn1yzGrPfSwyw=&se=2016-08-15T08:41:30Z&sp=rw

[3732+00000001] [08/14/2016 08:42:24.05] [INFO] DownloadFiles: Initializing CloudBlobClient with baseUri = https://iaasv2tempstorewestus.blob.core.windows.net/

[3732+00000001] [08/14/2016 08:42:24.22] [INFO] DownloadFiles: fileDownloadPath = Downloads\0

[3732+00000001] [08/14/2016 08:42:24.22] [INFO] DownloadFiles: asynchronously downloading file to fileDownloadLocation = Downloads\0\simple.ps1

[3732+00000001] [08/14/2016 08:42:24.24] [INFO] Waiting for all async file download tasks to complete...

[3732+00000001] [08/14/2016 08:42:24.29] [INFO] Files downloaded. Asynchronously executing command: 'powershell -ExecutionPolicy Unrestricted -File simple.ps1 '

[3732+00000001] [08/14/2016 08:42:24.29] [INFO] Command execution task started. Awaiting completion...

[3732+00000001] [08/14/2016 08:42:25.29] [INFO] Command execution finished. Command exited with code: 0

The CustomScript extension takes a script file which can be provided as a file in a Storage blob. Portal offers a convenience where it accepts a file from the local machine. I had provided Simple.ps1 which was in my \temp folder. Behind the scenes Portal uploads the file to a blob, generates a shared access signature (SAS) and passes it on to CRP. From the logs above you can see that URI.

This URI is worth understanding. It is a Storage blob SAS with the following attributes for an account in West US (which is the same region where my VM is deployed):

  • se=2016-08-15T08:41:30Z means that the SAS is valid until that time (UTC). Comparing it to the timestamp on the corresponding record in log (08/14/2016 08:42:24.05) it is clear that the SAS is being generated for a period of 24 hours.
  • Sr=c means that this is container level policy.
  • Sp=rw means that the access is for both read and write.
  • The shared access signature (SAS) has the full descriptions

I asserted above that this is a storage account in West US. That may be apparent from the naming of the storage account (iaasv2tempstorewestus) but is not a guarantee. So how can you verify that this storage account (or any other storage account) is in the region it claims to be in?

A simple nslookup on the blob DNS URL reveals this

C:\Users\yunusm>nslookup iaasv2tempstorewestus.blob.core.windows.net

Server: PK5001Z.PK5001Z

Address: 192.168.0.1

Non-authoritative answer:

Name: blob.by4prdstr03a.store.core.windows.net

Address: 40.78.112.72

Aliases: iaasv2tempstorewestus.blob.core.windows.net

The blob URL is a CNAME to a canonical DNS blob.by4prdstr03a.store.core.windows.net. Experimentation will show that more than one storage accounts maps to a single canonical DNS URL. The ‘by4’ in the name gives a hint to what region it is located. As per the Azure Regions page, the West US region is in California. Looking up the geo location of the IP address (40.78.112.72) indicates a more specific area within California.

Understanding the VM

Now that we have a healthy VM, let’s understand it more. As per the Azure VM Sizes page, this is the VM that I just created:

Size

CPU cores

Memory

NICs (Max)

Max. disk size

Max. data disks (1023 GB each)

Max. IOPS (500 per disk)

Max network bandwidth

Standard_D1_v2

1

3.5 GB

1

Temporary (SSD) =50 GB

2

2x500

moderate

This information can be fetched programmatically by doing a GET.

Returns this:

{

"name": "Standard_DS1_v2",

"numberOfCores": 1,

"osDiskSizeInMB": 1047552,

"resourceDiskSizeInMB": 7168,

"memoryInMB": 3584,

"maxDataDiskCount": 2

}

Doing a GET on the VM we created
Returns the following. Let’s understand this response in some detail. I have annotated inline comments preceded and followed by //
{
  "properties": {
    "vmId": "694733ec-46a0-4e0b-a73b-ee0863a0f12c",
    "hardwareProfile": {
      "vmSize": "Standard_DS1_v2"
    },
    "storageProfile": {
      "imageReference": {
        "publisher": "MicrosoftWindowsServer",
        "offer": "WindowsServer",
        "sku": "2012-R2-Datacenter",
        "version": "latest"

The interesting field here is the version. Publishers can have multiple versions of the same image at any point of time. Popular images are revved typically on a monthly frequency with the security patches. Major new versions are released as new SKUs. The Portal has defaulted me to the latest version. As a customer, I can chose to pick a specific version as well, whether I deploy thru Portal or thru an ARM template using the CLI or REST API; the latter being the preferred method for automated scenarios. The problem with specifying a particular version is that it can render the ARM template fragile. The deployment will break if the publisher unpublishes that specific version in one or more regions, as a publisher can do. So unless there is a good reason not to, the preferred value for the version setting is latest. As an example, the following images of the SKU 2012-R2-Datacenter are currently in the WestUS region, as returned by the CLI command azure vm image list.

MicrosoftWindowsServer  WindowsServer  2012-R2-Datacenter                      Windows  4.0.20151120     westus    MicrosoftWindowsServer:WindowsServer:2012-R2-Datacenter:4.0.20151120
MicrosoftWindowsServer  WindowsServer  2012-R2-Datacenter                      Windows  4.0.20151214     westus    MicrosoftWindowsServer:WindowsServer:2012-R2-Datacenter:4.0.20151214
MicrosoftWindowsServer  WindowsServer  2012-R2-Datacenter                      Windows  4.0.20160126     westus    MicrosoftWindowsServer:WindowsServer:2012-R2-Datacenter:4.0.20160126
MicrosoftWindowsServer  WindowsServer  2012-R2-Datacenter                      Windows  4.0.20160229     westus    MicrosoftWindowsServer:WindowsServer:2012-R2-Datacenter:4.0.20160229
MicrosoftWindowsServer  WindowsServer  2012-R2-Datacenter                      Windows  4.0.20160430     westus    MicrosoftWindowsServer:WindowsServer:2012-R2-Datacenter:4.0.20160430
MicrosoftWindowsServer  WindowsServer  2012-R2-Datacenter                      Windows  4.0.20160617     westus    MicrosoftWindowsServer:WindowsServer:2012-R2-Datacenter:4.0.20160617
MicrosoftWindowsServer  WindowsServer  2012-R2-Datacenter                      Windows  4.0.20160721     westus    MicrosoftWindowsServer:WindowsServer:2012-R2-Datacenter:4.0.20160721

      },
      "osDisk": {
        "osType": "Windows",
        "name": "BlogWindowsVM",
        "createOption": "FromImage",
        "vhd": {
          "uri": https://blogrgdisks562.blob.core.windows.net/vhds/BlogWindowsVM2016713231120.vhd

The OS disk is a page blob and starts out as a copy of the source image that the Publisher has published. Looking at the meta data of this blob and correlating it to what the VM itself has is instructive. Using the Cloud Explorer in Microsoft Visual Studio the blob’s property window:
1

This is a regular page blob that is functioning as an OS disk over the network. You will observe that the Last Modified date pretty much stays with NOW() most of the time – the reason being as long as the VM is running there are some flushes happening to the disk regularly. The size of the OS disk is 127 GB; the max allowed OS disk in Azure is 1 TB.

Azure Storage Explorer shows more properties for the same blob than the VS Cloud Explorer.

 

image

The interesting properties are the Lease properties. It shows the blob as leased with an infinite duration. Internally to VM creation, when a page blob is configured to be an OS/data disk for a VM, we take a lease on that blob before attaching it to the VM. This is so that the blob for a running VM is not deleted out of band. If you see a disk-backing blob has no lease while it shows as attached to a VM then that would be an inconsistent state and will need to be repaired.

RDPing in the VM itself, we can see two drives mounted and the OS drive is about the same size as the page blob in Storage. The pagefile is on D drive; so that faulted pages are fetched locally rather than over the network from Blob Storage. The temporary storage can be lost in case of events that case a VM to be relocated to a different node.

image

 },
  "caching": "ReadWrite"
},
"dataDisks": []

there are no data disks yet but we will add some soon

},
"osProfile": {
  "computerName": "BlogWindowsVM",

The name we chose for the VM in Portal is the hostname as well. The VM is DHCP enabled and gains its DIP address thru DHCP. The VM is registered in an internal DNS and has a generated FQDN.

C:\Users\yunusm>ipconfig /all

Windows IP Configuration

   Host Name . . . . . . . . . . . . : BlogWindowsVM
   Primary Dns Suffix  . . . . . . . :
   Node Type . . . . . . . . . . . . : Hybrid
   IP Routing Enabled. . . . . . . . : No
   WINS Proxy Enabled. . . . . . . . : No
   DNS Suffix Search List. . . . . . : qkqr4ajgme4etgyuajvm1sfy3h.dx.internal.cl
oudapp.net

Ethernet adapter:

   Connection-specific DNS Suffix  . : qkqr4ajgme4etgyuajvm1sfy3h.dx.internal.cl
oudapp.net
   Description . . . . . . . . . . . : Microsoft Hyper-V Network Adapter
   Physical Address. . . . . . . . . : 00-0D-3A-33-81-01
   DHCP Enabled. . . . . . . . . . . : Yes
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::980c:bf29:b2de:8a05%12(Preferred)
   IPv4 Address. . . . . . . . . . . : 10.1.0.4(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Lease Obtained. . . . . . . . . . : Saturday, August 13, 2016 11:14:58 PM
   Lease Expires . . . . . . . . . . : Wednesday, September 20, 2152 6:24:34 PM
   Default Gateway . . . . . . . . . : 10.1.0.1
   DHCP Server . . . . . . . . . . . : 168.63.129.16
   DHCPv6 IAID . . . . . . . . . . . : 301993274
   DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-1F-41-C4-70-00-0D-3A-33-81-01

   DNS Servers . . . . . . . . . . . : 168.63.129.16
   NetBIOS over Tcpip. . . . . . . . : Enabled

"adminUsername": "yunusm",
"windowsConfiguration": {
  "provisionVMAgent": true,

This is a hint to install a guest agent that does a bunch of config and runs the extensions. The guest agent binaries are here - C:\WindowsAzure\Packages

"enableAutomaticUpdates": true

Windows VMs by default are set to receive auto updates from Windows Update Service. There is a nuance to grasp here regarding availability and auto updates. If you have an Availability Set with multiple VMs with the purpose of getting high SLA against unexpected faults, then you do not want to have correlated actions (like Windows Updates) that can take down VMs across the Availability Set.

 

image

},

"secrets": []

},

"networkProfile": {

"networkInterfaces": [

{

"id": "/subscriptions/f028f547-f912-42b0-8892-89ea6eda4c5e/resourceGroups/BlogRG/providers/Microsoft.Network/networkInterfaces/blogwindowsvm91"

NIC is a standalone resource, we are not discussing networking resources yet.

    }
  ]
},
"diagnosticsProfile": {
  "bootDiagnostics": {
    "enabled": true,
    "storageUri": "https://blogrgdiag337.blob.core.windows.net/"
  }

Boot diagnostics have been enabled. Portal has a way of viewing the screenshot. You can get the URL for the screenshot from CLI:

C:\Program Files (x86)\Microsoft SDKs\Azure\CLI\bin>node azure vm get-serial-output

info: Executing command vm get-serial-output

Resource group name: blogrg

Virtual machine name: blogwindowsvm

+ Getting instance view of virtual machine "blogwindowsvm"

info: Console Screenshot Blob Uri:

https://blogrgdiag337.blob.core.windows.net/bootdiagnostics-blogwindo-694733ec-46a0-4e0b-a73b-ee0863a0f12c/BlogWindowsVM.694733ec-46a0-4e0b-a73b-ee0863a0f12c.screenshot.bmp

info: vm get-serial-output command OK

The boot screenshot can be viewed in Portal. However, the URL for the screenshot bmp file does not render in a browser.

What gives? It is due to the authentication on the storage account which blocks anonymous access. For any blob or container in Azure Storage it is possible to configure anonymous read access. Please do this with caution and only in cases where secrets will not be exposed. It is a useful capability for sharing data that is not confidential without having to generate SAS signatures. Once anonymous access is enabled on the container the screenshot renders in any browser outside of the portal.

image
    },
    "provisioningState": "Succeeded"
  },
  "resources": [
    {
      "properties": {
        "publisher": "Microsoft.Compute",
        "type": "CustomScriptExtension",
        "typeHandlerVersion": "1.7",
        "autoUpgradeMinorVersion": true,

It is usually safe for extensions to be auto updated on the minor version. There have been very few surprises in this regard though you have an option to not auto update.

        "settings": {
          "fileUris": [
            https://iaasv2tempstorewestus.blob.core.windows.net/vmextensionstemporary-10033fff801becb5-20160814084130535/simple.ps1?sv=2015-04-05&sr=c&sig=M3qa7lS%2BZwp%2B8Tytqf1VEew4YaAKvvYn1yzGrPfSwyw%3D&se=2016-08-15T08%3A41%3A30Z&sp=rw

As discussed earlier this is the SAS key for the powershell script. You will see this as a commonly used pattern to sharing files and data – upload to a blob, generate a SAS key and pass around.

          ],
          "commandToExecute": "powershell -ExecutionPolicy Unrestricted -File simple.ps1  "
        },
        "provisioningState": "Succeeded"
      },
      "id": "/subscriptions/f028f547-f912-42b0-8892-89ea6eda4c5e/resourceGroups/BlogRG/providers/Microsoft.Compute/virtualMachines/BlogWindowsVM/extensions/CustomScriptExtension",
      "name": "CustomScriptExtension",
      "type": "Microsoft.Compute/virtualMachines/extensions",
      "location": "westus"
    },
    {
      "properties": {
        "publisher": "Microsoft.Azure.Diagnostics",
        "type": "IaaSDiagnostics",
        "typeHandlerVersion": "1.5",
        "autoUpgradeMinorVersion": true,
        "settings": {
          "xmlCfg": <trimmed>,
          "StorageAccount": "blogrgdiag337"
        },
        "provisioningState": "Succeeded"
      },
      "id": "/subscriptions/f028f547-f912-42b0-8892¬-89ea6eda4c5e/resourceGroups/BlogRG/providers/Microsoft.Compute/virtualMachines/BlogWindowsVM/extensions/Microsoft.Insights.VMDiagnosticsSettings",
      "name": "Microsoft.Insights.VMDiagnosticsSettings",
      "type": "Microsoft.Compute/virtualMachines/extensions",
      "location": "westus"
    }
  ],
  "id": "/subscriptions/f028f547-f912-42b0-8892-89ea6eda4c5e/resourceGroups/BlogRG/providers/Microsoft.Compute/virtualMachines/BlogWindowsVM",
  "name": "BlogWindowsVM",
  "type": "Microsoft.Compute/virtualMachines",
  "location": "westus"
}

To Be Continued

We will carry on with what we can learn from a single VM and then move on to other topics.