How many data nodes do I need for my HDInsight cluster?
The number of data nodes will vary depending on your needs. With the elasticity available in Azure cloud services, you can try a variety of cluster sizes to determine your own optimal mix of performance and cost, and only pay for what you use at any given time. Clusters can also be scaled on demand to grow and shrink to match the requirements of your workload.
Related questions and answers
Each subscription has a default limit on how many HDInsight data nodes can be created. If you need to create a larger HDInsight cluster or multiple HDInsight clusters that together exceed your current subscription maximum, you can request for your subscription’s billing limits to be increased. Please open a support ticket with Support Type = Billing. Depending on the maximum nodes per subscription that you request, you may be asked for additional information that will allow us to optimise your deployment(s).
To estimate the cost of clusters of various sizes, try the Azure Calculator.
We charge for the number of minutes your cluster is running, rounded to the nearest minute, not hour.
HDInsight deploys a different number of nodes for each cluster type. Within a given cluster type there are different roles for the various nodes, which allows a customer to size those nodes in a given role appropriate to the details of their workload. For example, a Hadoop cluster can have its worker nodes provisioned with a large amount of memory if the analytics being performed are memory intensive. Hadoop clusters for HDInsight are deployed with two roles:
- Head node (2 nodes)
- Data node (at least 1 node) HBase clusters for HDInsight are deployed with three roles:
- Head servers (2 nodes)
- Region servers (at least 1 node)
- Master/Zookeeper nodes (3 nodes) Storm clusters for HDInsight are deployed with three roles:
- Nimbus nodes (2 nodes)
- Supervisor servers (at least 1 node)
- Zookeeper nodes (3 nodes) Spark clusters for HDInsight are deployed with three roles:
- Head node (2 nodes)
- Worker node (at least 1 node)
- Zookeeper nodes (3 nodes) (free for A1 zookeepers) The use of R-Server will incur one edge node in addition to the cluster deployment architecture.
If you run a cluster for 100 hours in US East with two D13 v2 head nodes, three D12 v2 data nodes, and three D11 v2 zookeepers, the billing would be the following in the two scenarios:
- On a Standard HDInsight cluster—100 hours x (2 x $-/hour + 3 x $-/hour + 3 x $-/hour) = $-
- On a Standard HDInsight cluster with Enterprise Security Package—100 hours x (2 x $-/hour + 3 x $-/hour + 3 x $-/hour) + 100 hours x (2 x 8 + 3 x 4 + 3 x 2) x $-/core-hour = $-
In order to stop an HDInsight cluster, you must delete the cluster. By default, all data that an HDInsight cluster operates on resides in Azure Blob storage, so data will not be affected by this. If you want to preserve your Hive metadata (tables, schemas) you should provision a cluster with an external metadata store. You can find more details in this documentation.