Skip Navigation

Apache Kafka for HDInsight

Managed high-throughput, low-latency service for real-time data

Kafka for HDInsight is an enterprise-grade, open-source, streaming ingestion service that’s cost-effective and easy to set up, manage, and use. Build real-time solutions such as Internet of Things (IoT), fraud detection, clickstream analysis, financial alerts, and social analytics.

Managed Kafka with a 99.9% SLA

Purchasing the hardware, installing and tuning the bits requires a lot of time and effort. Ensuring that these machines are always up and running such that no data is lost is an even greater challenge and has a huge cost of ownership. Kafka for Azure HDInsight manages all of this for you. Through 4 clicks, Kafka clusters are up and running within minutes, with a 99.9% SLA on the Kafka uptime. This means that you can concentrate on writing realtime applications, their logic and building the higher level pipelines instead of worrying about installing new Kafka brokers or fixing broken ones.

Rack awareness for Azure Environments

Kafka was designed with a single dimensional view of a rack which works well on some environments. However on environments such as Azure, a rack is separated out into two dimensions - Update Domains (UDs) and Fault Domains (FDs). HDInsight Kafka has developed scalable and robust tools ensure Kafka is rack aware on the Azure environments. These tools rebalance the partitions and replicas across the UDs and FDs for the highest levels of Kafka availabilities across Azure Availability Zones.

Integration with Azure Managed Disks

Due to the ingestion heavy nature, the disks attached to the nodes on the cluster often result as the bottleneck. Traditionally, to scale this bottleneck, more nodes need to be added. Azure Managed Disks is a technology that provides cheaper, scalable disks that are a fraction of the cost of a node. HDInsight Kafka has integrated with these disks to provide upto 16 TB/node instead of the traditional 1 TB. This results in an exponentially higher scale, while reducing costs in the inverse, exponential manner. Our enterprise customers have been able to save thousands of dollars per month due to this innovation.

Out of the box alerting, monitoring and predictive maintenance

Getting a streaming pipeline up and running is just the start -- ensuring that it is performing reliably with no issues requires huge investments in monitoring and alerting infrastructures. Kafka for HDInsight takes away this problem as it is integrated with Azure’s monitoring suite out of the box. This technology allows you to monitor everything from VM level disk and NIC metrics to JMX metrics from Kafka, Storm and Spark. Not only can you create powerful alerting and monitoring dashboards, you can specify scripts and runbooks against these metrics for automated and predictive maintenance of your streaming pipeline.

MirrorMaker support for replicating Kafka data

Kafka is often deployed in multiple environments for Disaster Recovery, high availability, and on-prem to cloud hybrid scenarios. These require replication of data from one Kafka to the other. HDInsight has worked closely with enterprise customers to understand this need, and provides support for data replication scenarios. Mirroring on HDInsight Kafka is easy to setup and use.

Cluster scaling within minutes

Estimates for message sizes and messages/sec and streaming needs change as the pipeline is used. Traditionally, the peak traffic is what the cluster is sized for, which results in very high costs for unused capacity. When the time comes to add more nodes, the new machines need to be provisioned, installed, and configured with customizations reapplied. On HDInsight Kafka, start with small clusters and scale them up as needed, providing for exponentially lower costs. HDInsight takes care of provisioning the new nodes, with the customizations applied within minutes.

What can you build with Kafka for HDInsight?

Learn about use cases below:

Data comes in from various event sources (applications, devices, sensors, web, social) and is collected in the cloud through web APIs or field gateways. The data stream is ingested by Kafka for HDInsight for processing and analytics with services like Azure Machine Learning, Spark for HDInsight, Storm for HDInsight, and storage adapters. The data moves to long-term storage with services like Apache HBase on HDInsight, DocumentDB, MonoDB SQL, Solr Azure, Data Lake store, and Azure Search. Then you can run your real-time dashboards, queries, and analytics, or send data to devices to take action.

Customers using Kafka for HDInsight

  • Office 365
  • Toyota
  • Bing ads
Toyota Connected

"Toyota manufactures millions of cars running globally, and building a connected car platform to process real-time data at Toyota scale is a monumental challenge. To process events at Toyota's scale, technologies such as Kafka need to be leveraged. Since HDInsight is the only managed platform that provides Kafka as a managed service with a 99.9% SLA, Toyota was able to leverage the scalable technology of Kafka, Storm and Spark on Azure HDInsight. Using the HDInsight platform, we were able to deploy enterprise grade streaming pipelines to process events from millions of cars every second. This is just scratching the surface - the future of global connected cars on Azure HDInsight is bright, and we are excited for what's in store."

Vijay Chemuturi, Chief Product Owner, Toyota Connected

New to Kafka for HDInsight?

Use the links below to create robust, enterprise ready streaming pipelines using Kafka, Storm, and Spark Streaming on Azure.

Monitor realtime streaming pipelines with Azure

Learn how to use HDInsight Kafka's integration with Azure Monitoring to create powerful alerting and monitoring dashboards, and automated scripts and runbooks predictive maintenance of your streaming pipeline.

Try Kafka for HDInsight