Apache Kafka on Azure HDInsight was added last year as a preview service to help enterprises create real-time big data pipelines. Since then, large companies such as Toyota, Adobe, Bing Ads, and GE have been using this service in production to process over a million events per sec to power scenarios for connected cars, fraud detection, clickstream analysis, and log analytics. HDInsight has worked very closely with these customers to understand the challenges of running a robust, real-time production pipeline at an enterprise scale. Using our learnings, we have implemented key features in the managed Kafka service on HDInsight, which is now generally available.
A fully managed Kafka service for the enterprise use case
Running big data streaming pipelines is hard. Doing so with open source technologies for the enterprise is even harder. Apache Kafka, a key open source technology, has emerged as the de-facto technology for ingesting large streaming events in a scalable, low-latency, and low-cost fashion. Enterprises want to leverage this technology, however, there are many challenges with installing, managing, and maintaining a streaming pipeline. Open source bits lack support and in-house talent needs to be well versed with these technologies to ensure the highest levels of up-time. Every second an ingestion pipeline is down, data is lost.
The HDInsight team learned from the challenges that enterprises faced while installing and operating Apache Kafka, and introduced Apache Kafka on HDInsight as a managed service last November. HDInsight is a managed platform with a 99.9% SLA on open source workloads. With this addition, our enterprise customers no longer worry about managing Kafka clusters, as HDInsight manages and fixes the issues involved with running Kafka at an enterprise scale. Not only did we onboard Apache Kafka as a fully managed service on Azure HDInsight, we used our customer’s feedback to innovate key features within the managed service.
We introduced native integration with Azure Managed Disks, which reduces costs exponentially as these workloads are scaled out for large enterprises like Toyota and Bing Ads. We also introduced tools for implementing rack awareness in Apache Kafka for the Azure environment to ensure the highest levels of Kafka availability on HDInsight. These key features and the general availability of Apache Kafka on HDInsight complete an end to end streaming pipeline on the Azure platform. Enterprises can deploy highly scalable, fault tolerant, and secure real-time architectures with Apache Kafka, Apache Spark, and Apache Storm on the managed HDInsight platform with a single click.
Customer success stories
The preview launch of managed Kafka on HDInsight received an overwhelming response. The Kafka team at HDInsight collaborated with large enterprises to help enable their streaming big data scenarios encompassing connected cars, clickstream analytics, fraud detection, real-time patient care, and more. In addition to getting these scenarios off the ground, managed Kafka on HDInsight serves as the backbone that powers them in production today – processing upwards of a trillion events/day. Along with the release including managed disks in June, we showed how millions of Toyota cars are ingesting data every second on Azure through the managed Kafka service on HDInsight . Today we have a few more scenarios to showcase.
Adobe Experience Cloud
Adobe's Experience Cloud provides industry-leading analytics and data processing for the world's top firms. In early 2017, Adobe's team needed a new way to ingest massive amounts of data for some of their most demanding customers. Not only did Adobe need to develop an entire new way of accepting this data, but they needed to build, test, and ship this new product feature in just a few month’s time. Adobe decided to use HDInsight Kafka to help them build this new pipeline and have been successfully using it to process over a billion transactions each day.
"Azure's HDInsight Kafka was the perfect solution for us. We had used Kafka successfully in other projects, but it would have taken us too long to get an internal Kafka instance deployed and ready to use. Using Azure, we had a development Kafka cluster ready in hours that helped us build our new system in record time. When we were ready to start ingesting live data with our new system we decided to continue using HDInsight Kafka and it has been a solid part of our infrastructure for months."
– Josh Butikofer, Sr. Software Architect, Adobe
GE Healthcare digital health revolution
“At GE Healthcare, we apply cutting edge technological innovations in cloud big data and machine learning to solve problems faced by thousands of clinics, hospitals, health care providers and millions of patients every day. We use Apache Kafka as a key technology we use to power these intelligent scenarios. Azure HDInsight provides Apache Kafka and Apache Spark as managed services, which makes it very easy for us to manage and operate these services in the Azure cloud at GE’s scale. This is just the start – we hope to bring this scale of analytics and intelligence to drive productive transformations in the digital health revolution.”
– Animesh Mahapatra, Director Software Engineering, GE
Microsoft Office365, Skype, and Bing Ads
“Data is the backbone of Microsoft's massive scale cloud services such as Bing, Office365, and Skype. Siphon is a service that provides a highly available and reliable distributed Data Bus for ingesting, distributing, and consuming near real-time data streams for processing and analytics for these services. For Siphon, we rely on Azure HDInsight Kafka as a core building block that is highly reliable, scalable, and cost effective. Siphon ingests over a trillion messages per day, and we look forward to leverage HDInsight Kafka to continue to grow in scale and throughput.”
– Thomas Alex, Principal Program Manager, Microsoft
Learn more and get started
In addition to the video below, watch the Azure Friday video with Scott Hanselman and follow the learning paths listed.
Apache Kafka on Azure HDInsight quick-start
- Deploy Kafka on Azure HDInsight with one click
- Configure high availability with Rack awareness
- Use Kafka with Spark Structured Streaming on Azure HDInsight
- Learn how to use HDInsight Kafka's integration with Azure Monitoring
- Use MirrorMaker to replicate data from on-premises, or another Kafka instance to, and from HDInsight Kafka.
If you have any questions or feedback, please reach out to AskHDInsight@microsoft.com.