Skip navigation

Apache Kafka for HDInsight

Managed high-throughput, low-latency service for real-time data

Kafka for HDInsight is an enterprise-grade, open-source, streaming ingestion service that’s cost-effective and easy to set up, manage and use. Build real-time solutions such as Internet of Things (IoT), fraud detection, clickstream analysis, financial alerts and social analytics.

Managed Kafka with a 99.9% SLA

Purchasing the hardware, installing and tuning the bits requires a lot of time and effort. Ensuring that these machines are always up and running such that no data is lost is an even greater challenge and has a huge cost of ownership. Kafka for Azure HDInsight manages all of this for you. With four clicks, Kafka clusters are up and running within minutes, with a 99.9% SLA on the Kafka uptime. This means that you can concentrate on writing real-time applications, their logic and building the higher-level pipelines instead of worrying about installing new Kafka brokers or fixing broken ones.

Rack awareness for Azure Environments

Kafka was designed with a single dimensional view of a rack, which works well on some environments. However, on environments such as Azure, a rack is separated out into two dimensions – Update Domains (UDs) and Fault Domains (FDs). HDInsight Kafka has developed scalable and robust tools to ensure that Kafka is rack-aware on the Azure environments. These tools rebalance the partitions and replicas across the UDs and FDs for the highest levels of Kafka availabilities across Azure Availability Zones.

Integration with Azure Managed Disks

Due to the ingestion-heavy nature, the disks attached to the nodes on the cluster often result in a bottleneck. Traditionally, to scale this bottleneck, more nodes need to be added. Azure Managed Disks is a technology that provides cheaper, scalable disks that are a fraction of the cost of a node. HDInsight Kafka has integrated with these disks to provide up to 16 TB/node instead of the traditional 1 TB. This results in an exponentially higher scale, while reducing costs in the inverse, exponential manner. Our enterprise customers have been able to save thousands of pounds per month due to this innovation.

Out-of-the-box alerting, monitoring and predictive maintenance

Getting a streaming pipeline up and running is just the start – ensuring that it is performing reliably with no issues requires huge investments in monitoring and alerting infrastructures. Kafka for HDInsight takes away this problem, as it is integrated with Azure’s monitoring suite out of the box. This technology allows you to monitor everything from VM-level disk and NIC metrics to JMX metrics from Kafka, Storm and Spark. Not only can you create powerful alerting and monitoring dashboards, you can specify scripts and runbooks against these metrics for automated and predictive maintenance of your streaming pipeline.

MirrorMaker support for replicating Kafka data

Kafka is often deployed in multiple environments for Disaster Recovery, high availability and on-premises to cloud hybrid scenarios. These require replication of data from one Kafka to the other. HDInsight has worked closely with enterprise customers to understand this need, and provides support for data replication scenarios. Mirroring on HDInsight Kafka is easy to set up and use.

Cluster scaling within minutes

Estimates for message sizes and messages/sec and streaming needs change as the pipeline is used. Traditionally, the peak traffic is what the cluster is sized for, which results in very high costs for unused capacity. When the time comes to add more nodes, the new machines need to be provisioned, installed and configured with customisations reapplied. On HDInsight Kafka, start with small clusters and scale them up as needed, providing for exponentially lower costs. HDInsight takes care of provisioning the new nodes, with the customisations applied within minutes.

What can you build with Kafka for HDInsight?

Learn about use cases below:

Data comes in from various event sources (applications, devices, sensors, web, social) and is collected in the cloud through web APIs or field gateways. The data stream is ingested by Kafka for HDInsight for processing and analytics with services such as Azure Machine Learning, Spark for HDInsight, Storm for HDInsight and storage adapters. The data moves to long-term storage with services such as Apache HBase on HDInsight, DocumentDB, MonoDB SQL, Solr Azure, Data Lake store and Azure Search. You can then run your dashboards, queries and analytics in real time, or send data to devices to take action.

Customers using Kafka for HDInsight

  • Office 365
  • Toyota
  • Bing ads
Toyota Connected

"Toyota manufactures millions of cars running globally, and building a connected car platform to process real-time data at Toyota scale is a monumental challenge. To process events at Toyota's scale, technologies such as Kafka need to be leveraged. Since HDInsight is the only managed platform that provides Kafka as a managed service with a 99.9% SLA, Toyota was able to leverage the scalable technology of Kafka, Storm and Spark on Azure HDInsight. Using the HDInsight platform, we were able to deploy enterprise grade streaming pipelines to process events from millions of cars every second. This is just scratching the surface - the future of global connected cars on Azure HDInsight is bright, and we are excited for what's in store."

Vijay Chemuturi, Chief Product Owner, Toyota Connected

New to Kafka for HDInsight?

Use the links below to create robust, enterprise-ready streaming pipelines using Kafka, Storm and Spark Streaming on Azure.

Monitor real-time streaming pipelines with Azure

Learn how to use HDInsight Kafka’s integration with Azure Monitoring to create powerful alerting and monitoring dashboards, and automated scripts and runbooks, for predictive maintenance of your streaming pipeline.

Try Kafka for HDInsight