HDInsight set a firm goal of helping enterprises build secure, robust, scalable open source streaming pipelines on Azure. To meet this goal, a few months ago we announced a limited preview of Managed Kafka on Azure HDInsight. The addition of Kafka on HDInsight completes the ingestion piece for scalable open source streaming on Azure. In addition to the scale and performance benefits of Apache Kafka, HDInsight Kafka customers reap the following advantages:
- The promise of a managed open source Kafka backed by a 99.9% uptime SLA
- This includes installation, configuration, and management of open source components
- HDInsight additionally provisions and monitors a Zookeeper quorum as part of the cluster shape.
- Managed rebalance of replicas and partitions across Azure update domains and fault domains. This ensures high availability of Kafka partitions on environments with a multidimensional view of a rack. This tool is also open sourced here.
- Security and compliance benefits of Azure and HDInsight with certifications such as SOC, PCI, DSS.
- An integrated experience to deploy a managed and secure streaming pipeline (Kafka, Storm or Spark streaming) within minutes via prebuilt architectures on ARM templates.
Today we are pleased to announce the Public Preview of Apache Kafka with Azure Managed Disks on the HDInsight platform. Users can now deploy Kafka clusters with managed disks straight from Azure portal, with no signup necessary. This allows for the powerful advantage of exponentially higher scalability, alongside exponential lower cost as workloads scale. This feature is discussed in more detail below.
Customer Success Stories: Toyota Connected
Over the last year HDInsight has worked very closely with is Toyota Inc. to build one of the world’s largest and most distributed connected car streaming platform. This platform processes millions of large events/day in production on HDInsight Kafka to unlock insights in real-time. A platform at this scale was made possible by the secure, managed, and elastically scalable nature of HDInsight. The benefits are best explained by the Chief Product Owner of Toyota Connected below.
“Toyota manufactures millions of cars running globally, and building a connected car platform to process real-time data at Toyota scale is a monumental challenge. To process events at Toyota’s scale, technologies such as Kafka need to be leveraged. Since HDInsight is the only managed platform that provides Kafka as a managed service with a 99.9% SLA, Toyota was able to leverage the scalable technology of Kafka, Storm and Spark on Azure HDInsight. Using the HDInsight platform, we were able to deploy enterprise grade streaming pipelines to process events from millions of cars every second. This is just scratching the surface – the future of global connected cars on Azure HDInsight is bright, and we are excited for what’s in store.” -Vijay Chemuturi, Chief Product Owner, Toyota Connected
A high-level architecture of the connected car architecture is depicted below. As Vijay states, this is just the beginning – we are very excited to build upon this powerful streaming platform in the upcoming months.
More details on architecting similar IoT scenarios will follow in upcoming series of blogs.
Integration of HDInsight Kafka with Azure Managed Disks
With this public preview, HDInsight Kafka is also releasing native integration with Azure Managed Disks.
Azure Managed Disks is a new feature that abstracts the storage account specification for the customer allowing for an easier and managed route to use disks. They provide for a higher scale by abolishing the storage account IOPS limitation, along with the ability to create hundreds of VMs from a given VHD in a centralized storage account. A disk can be either Premium (SSD) or Standard (HDD), and 1 TB in size. More information on this feature is located here.
Kafka is a high throughput, low latency messaging service that is I/O heavy. Prior to Azure Managed Disks, HDInsight Kafka’s original preview offering stored data on the largest persisted disk of the node. This meant that each node had a limitation of 1 TB. Given Kafka’s I/O heavy nature the disk would often become the bottleneck and additional nodes needed to be added for more storage. This resulted in high cost, with a gross underutilization of the CPU and memory on the cluster. With this release, we are implementing the HDInsight Kafka with Managed Disks feature, which is pictorially depicted below.
With this feature, one can have both persisted and scalable data, up to 16 TBs per node. This allows for an exponentially lower cost, higher scalability and better performance as the workloads increase. Since the cost of a disk is a fraction of the cost of a node, the below figure shows how the number of nodes and cost scales down exponentially as scaling needs increase.
This feature is automatically turned on, and taking advantage of this feature is simple – the user just needs to specify the number of disks to be attached to a given node. This can be done via the portal, or by specifying a single property in the ARM template, shown in the below figures. Note that the type of disk – Premium or Standard is determined by the type of VMs chosen the worker nodes. Premium disks are attached to DS and GS series VMs, whereas standard are attached to all other VM types. End to end templates, with examples on how to create these clusters are detailed in the next section. More information on this is located in our documentation.
Disk Specification via Portal 1 |
Disk Specification via ARM template 1 |
Start deploying and using Spark, Storm, and Kafka with Managed Disks on HDInsight within minutes
We have updated our documentation and samples to help deploy scalable open source streaming solutions on HDInsight. Each of these examples walks through creating the clusters step by step, and contain one-click deploy ARM templates to enable powerful pipelines. We have additionally updated the Spark Streaming examples to include the new examples for Structured Streaming, and creating an end to end pipeline using Twitter, Kafka, Spark Streaming..
- Getting Started with Kafka for HDInsight
- Deploy HDInsight Kafka + Spark streaming
- Deploy HDInsight Kafka + Storm
- Stream data from on-premise to HDInsight Kafka in the cloud
- Stream tweets to HDInsight Kafka and process with Spark structured streaming
For any questions, suggestions or feedback, please do not hesitate to reach out to us via HDIFeedback@microsoft.com. We are really excited to have you onboard, and would love to hear from you.