Create Windows-based Hadoop clusters in HDInsight
Learn how to plan for creating HDInsight clusters.
Before you begin the instructions in this article, you must have the following:
- An Azure subscription. See Get Azure free trial.
The following are the basic configuration options for creating an HDInsight cluster.
Cluster name is used to identify a cluster. Cluster name must follow the following guidelines:
- The field must be a string that contains between 3 and 63 characters
- The field can contain only letters, numbers, and hyphens.
An HDInsight cluster is tied to one Azure subscription.
Resource group name
Azure Resource Manager (ARM) enables you to work with the resources in your application as a group, referred to as an Azure Resource Group. You can deploy, update, monitor or delete all of the resources for your application in a single, coordinated operation. For more information, see Azure Resource Manager Overview.
You can create HDInsight clusters on one of the following two operating systems:
- HDInsight on Windows (Windows Server 2012 R2 Datacenter):
- HDInsight on Linux (Ubuntu 12.04 LTS for Linux): HDInsight provides the option of configuring Linux clusters on Azure. Configure a Linux cluster if you are familiar with Linux or Unix, migrating from an existing Linux-based Hadoop solution, or want easy integration with Hadoop ecosystem components built for Linux. For more information, see Get started with Hadoop on Linux in HDInsight.
Cluster type and cluster size (a.k.a. data nodes)
HDInsight allows customers to deploy a variety of cluster types, for different data analytics workloads. Cluster types offered today are:
- Hadoop clusters: for query and analysis workloads
- HBase clusters: for NoSQL workloads
- Storm clusters: for real time event processing workloads
- Spark clusters (preview): for in-memory processing, interactive queries, stream, and machines learning workloads.
Azure HDInsight cluster is also called Hadoop clusters in HDInsight, or HDInsight cluster. Sometimes, it is used interchangeably with Hadoop cluster. They all refer to the Hadoop clusters hosted in the Microsoft Azure environment.
Within a given cluster type, there are different roles for the various nodes, which allow a customer to size those nodes in a given role appropriate to the details of their workload. For example, a Hadoop cluster can have its worker nodes created with a large amount of memory if the type of analytics being performed are memory intensive.
Hadoop clusters for HDInsight are deployed with two roles: - Head node (2 nodes) - Data node (at least 1 node)
HBase clusters for HDInsight are deployed with three roles: - Head servers (2 nodes) - Region servers (at least 1 node) - Master/Zookeeper nodes (3 nodes)
Storm clusters for HDInsight are deployed with three roles: - Nimbus nodes (2 nodes) - Supervisor servers (at least 1 node) - Zookeeper nodes (3 nodes)
Spark clusters for HDInsight are deployed with three roles: - Head node (2 nodes) - Worker node (at least 1 node) - Zookeeper nodes (3 nodes) (Free for A1 Zookeepers)
Customers are billed for the usage of those nodes for the duration of the cluster’s life. Billing starts once a cluster is created and stops when the cluster is deleted (clusters can’t be de-allocated or put on hold). The cluster size affects the cluster price. For learning purposes, it is recommended to use 1 data node. For more information about HDInsight pricing, see HDInsight pricing.
It is used to determine the version of HDInsight to use for this cluster. For more information, see Hadoop cluster versions and components in HDInsight
HDInsight cluster and its default storage account must be located on the same Azure location.
For a list of supported regions, click the Region drop-down list on HDInsight pricing.
Select the VM size for the nodes. For more information, see Sizes for Cloud Services. You can select the size of compute resources used by the cluster. For example, if you know that you will be performing operations that need a lot of memory, you may want to select a compute resource with more memory.
The nodes used by your cluster do not count as Virtual Machines, as the Virtual Machines images used for the nodes are an implementation detail of the HDInsight service; however, the compute cores used by the nodes do count against the total number of compute cores available to your subscription. You can see the number of cores that will be used by the cluster, as well as the number of cores available, in the summary section of the Node Pricing Tiers blade when creating an HDInsight cluster.
Different cluster types have different node types, number of nodes, and node sizes. For example, a Hadoop cluster type has two head nodes and a default of four data nodes, while a Storm cluster type has two nimbus nodes, three zookeeper nodes, and a default of four supervisor nodes.
If you plan on more than 32 worker nodes, either at cluster creation or by scaling the cluster after creation, then you must select a head node size with at least 8 cores and 14GB RAM.
When using the Azure preview portal to configure the cluster, the Node size is available through the Node Pricing Tier blade, and will also display the cost associated with the different node sizes.
Billing starts once a cluster is created, and only stops when the cluster is deleted. For more information on pricing, see HDInsight pricing details.
The HDInsight clusters allow you to configure two user accounts during provisioning:
- HTTP user. The default user name is admin using the basic configuration on the Azure Portal.
- RDP user (Windows clusters): It is used to connect to the cluster using RDP. When you create the account, you must set an expiration date that is within 90 days from today.
- SSH User (Linux clusters): Is used to connect to the cluster using SSH. You can create additional SSH user accounts after the cluster is created by following the steps in Use SSH with Linux-based Hadoop on HDInsight from Linux, Unix, or OS X.
Azure storage account
The original HDFS uses of many local disks on the cluster. HDInsight uses Azure Blob storage instead for data storage. Azure Blob storage is a robust, general-purpose storage solution that integrates seamlessly with HDInsight. Through a Hadoop distributed file system (HDFS) interface, the full set of components in HDInsight can operate directly on structured or unstructured data in Blob storage. Storing data in Blob storage enables you to safely delete the HDInsight clusters that are used for computation without losing user data.
During configuration, you must specify an Azure storage account and an Azure Blob storage container on the Azure storage account. Some creation process requires the Azure storage account and the Blob storage container created beforehand. The Blob storage container is used as the default storage location by the cluster. Optionally, you can specify additional Azure Storage accounts (linked storage) that will be accessible by the cluster. In addition, the cluster can also access any Blob containers that are configured with full public read access or pulic read access for blobs only. For more information on the restrict access, see Manage Access to Azure Storage Resources.
For more information on using secondary Blob stores, see Using Azure Blob Storage with HDInsight.
The metastore contains Hive and Oozie metadata, such as Hive tables, partitions, schemas, and columns. Using the metastore helps you to retain your Hive and Oozie metadata, so that you don't need to re-create Hive tables or Oozie jobs when you create a new cluster. By default, Hive uses an embedded Azure SQL database to store this information. The embedded database can't preserve the metadata when the cluster is deleted. For example, you have a cluster created with a Hive metastore. You created some Hive tables. After you delete the cluster, and recreat the cluster using the same Hive metastore, you will be able to see the Hive tables you created in the original cluster.
Sometimes, you want to configure the configuration files:
The clusters can't retain the changes due to re-image. For more information, see Role Instance Restarts Due to OS Upgrades. To keep the changes through the clusters' lifetime, you can use HDInsight cluster customization during the creation process. This is the recommended way to change configurations of a cluster and persist across these Azure reimage reboot restart events. These configuration changes are applied before service start, so services needn’t be restarted.
For more a sample, see Customize HDInsight clusters using Bootstrap.
You can install additional components or customize cluster configuration by using scripts during creation. Such scripts are invoked via Script Action, which is a configuration option that can be used from the Portal, HDInsight Windows PowerShell cmdlets, or the HDInsight .NET SDK. For more information, see Customize HDInsight cluster using Script Action.
Azure Virtual Network allows you to create a secure, persistent network containing the resources you need for your solution. A virtual network allows you to:
Connect cloud resources together in a private network (cloud-only).
Connect your cloud resources to your local data-center network (site-to-site or point-to-site) by using a virtual private network (VPN).
Site-to-site configuration allows you to connect multiple resources from your data center to the Azure virtual network by using a hardware VPN or the Routing and Remote Access Service.
Point-to-site configuration allows you to connect a specific resource to the Azure virtual network by using a software VPN.
For information on using HDInsight with a Virtual Network, including specific configuration requirements for the Virtual Network, see Extend HDInsight capbilities by using an Azure Virtual Network.
In this article, you have learned basic information about creating a Windows-based HDInsight cluster. Use the table below to find specific information on how to create a cluster using a method that best suits your needs:
|Use this to create a cluster...||Using a web browser...||Using a command-line||Using the REST API||Using an SDK||From Linux, Mac OS X, or Unix||From Windows|