Use Apache Kafka with Apache Spark on hdinsight

This is a basic example of using Apache Spark on HDInsight to stream data from Kafka to Azure Cosmos DB. This example uses Spark Structured Streaming and the Azure Cosmos DB Spark Connector.

This example requires Kafka and Spark on HDInsight 4.0 in the same Azure Virtual Network. It also requires an Azure Cosmos DB SQL API database.

NOTE: Apache Kafka and Spark are available as two different cluster types. HDInsight cluster types are tuned for the performance of a specific technology; in this case, Kafka and Spark. To use both together, you must create an Azure Virtual network and then create both a Kafka and Spark cluster on the virtual network.

The azuredeploy.json template file in this repo creates Kafka and Spark clusters in HDInsight, inside an Azure Virtual Network. It also creates an Azure Cosmos DB account.

IMPORTANT: For a more detailed walkthrough of using this project, see the Use Spark Structured Streaming with Kafka and Azure Cosmos DB document.

Deploy Azure resources

To create an environment that can be used to run this example, use the Deploy to Azure button to deploy the following Azure resources:

  • A virtual network

  • Spark on HDInsight 4.0

  • Kafka on HDInsight 4.0

  • Azure Cosmos DB SQL API database

When the template loads, you must provide the following information:

  • Cosmos DB account name: The name of the Cosmos DB account that is created.

  • Base cluster name: This must be a unique alphanumeric value, and is used to generate the following resource names:

    • Virtual Network: basename-network

    • Spark on HDInsight: spark-basename

    • Kafka on HDInsight: kafka-basename

    • Azure Storage Account: basenamestore

  • Cluster login name: This is the name used when logging in to Jupyter notebooks on the Spark cluster.

  • Cluster login password: The password for the login account.

  • SSH user name: The SSH user account for the cluster.

  • SSH password: The SSH user password.

NOTE: SSH isn't used by this example, but you must still provide these values.

NOTE: This template creates an Azure Virtual Network with an IP address space of 10.0.0.0\16.

Understand this example

This example uses a Scala application in a Jupyter notebook. The code in the notebook relies on the following pieces of data:

  • Kafka brokers: The broker process runs on each workernode on the Kafka cluster. The list of brokers is required by the producer component, which writes data to Kafka.

  • Topic name: The name of the topic that data is written to and read from. The default name used in the example is tripdata.

  • Cosmos DB endpoint: The HTTPS endpoint used to connect to Cosmos DB.

  • Master key: The key used to authenticate requests to the endpoint.

  • Database: The name of the Cosmos DB SQL API database that data is written to.

  • Collection: The document collection within the database.

  • PreferredRegions: The preferred Azure regions to use when writing to the database.

NOTE: The notebooks contain information and links on how to obtain this information.

WARNING: HDInsight is charged hourly. To prevent unnecessary costs, delete the cluster once you are finished with this example.

To run this example

To use the example Jupyter notebooks, you must upload them to the Jupyter Notebook server on the Spark cluster. Use the following steps to upload the notebooks:

  1. In your web browser, use the following URL to connect to the Jupyter Notebook server on the Spark cluster. Replace CLUSTERNAME with the name of your Spark cluster.

     https://CLUSTERNAME.azurehdinsight.net/jupyter
    

    When prompted, enter the cluster login (admin) and password used when you created the cluster.

  2. From the upper right side of the page, use the Upload button to upload the Stream-taxi-data-to-Kafka.ipynb file. Select the file in the file browser dialog and select Open.

  3. Find the Stream-taxi-data-to-Kafka.ipynb entry in the list of notebooks, and select Upload button beside it.

  4. Once the file has uploaded, select the Stream-taxi-data-to-kafka.ipynb entry to open the notebook. To load data into Kafka, follow the instructions in the notebook.

  5. Repeat steps 1-3 to upload the Stream-data-from-Kafka-to-Cosmos-DB.ipynb document to Kafka. Once the file has uploaded, select the entry to open the notebook. Follow the instructions in the notebook to read the data from Kafka and store it into Cosmos DB.