Debug Spark Code Running in Azure HDInsight from Your Desktop

Posted on 28 August, 2017

Principal Program Manager, Big Data Team

This month’s IntelliJ HDInsight Tools release delivers a robust remote debugging engine for Spark running in the Azure cloud. The Azure Toolkit for IntelliJ is available for users running Spark to perform interactive remote debugging directly against code running in HDInsight.

Debugging big data applications is a longstanding pain point. The data-intensive, distributed, scalable computing environment in which big data apps run is inherently difficult to troubleshoot, and this is no different for Spark developers. There is little tooling support for debugging such scenarios, leaving developers with manual, brute-force approaches that are cumbersome, and come with limitations. Common approaches include local debugging against sample data which poses limitations on data size; analysis of log files after the app has completed, requiring manual parsing of unwieldy log files; or use of a Spark shell for line by line execution, which does not support break points.

Azure Toolkit for IntelliJ addresses these challenges by allowing the debugger to attach to Spark processes on HDInsight for direct remote debugging. Developers connect to the HDInsight cluster at any time, leverage IntelliJ built-in debug capabilities, and automatically collect log files. The steps for this interactive remote debugging are the same ones developers are familiar with from debugging one-box apps. Developers do not need to know the configurations of the cluster, nor understand the location of the logs.

To learn more, watch this demo of HDInsight Spark Remote Debugging.

Key customer benefits

  • Use IntelliJ to run and debug Spark application remotely on an HDInsight cluster anytime via “Run->Edit Configurations”.
  • Use IntelliJ built-in debugging capabilities, such as conditional breakpoints, to quickly identify data-related errors. Developers can inspect variables, watch intermediate data, step through code, and finally edit the app and resume execution – all against Azure HDInsight clusters with production data.
  • Set a breakpoint for both driver and executor code. Debugging executor code lets developers detect data-related errors by viewing RDD intermediate values, tracking distributed task operations, and stepping through execution units.
  • Set a breakpoint in Spark external libraries allowing developers to step into Spark code and debug in the Spark framework.
  • View both driver and executor code execution logs in the console panel (see the “Driver Tab” and “Executor Tab”).

How to start debugging

The initial configuration to connect to your HDInsight Spark cluster for remote debugging is as simple as a few clicks in the advanced configuration dialog. You can set up a breakpoint on the driver code and executor code in order to step through the code and view the execution logs. To learn more, read the user guide Spark Remote Debug through SSH.

Executor Console

How to install or update

You can get the latest bits by going to IntelliJ repository, and search “Azure Toolkit.” IntelliJ will also prompt you for the latest update if you have already installed the plugin.

9

For more information, visit the following resources:

Learn more about today’s announcements on the Azure blog and Big Data blog. Discover more Azure service updates.   

If you have questions, feedback, comments, or bug reports, please use the comments below or send a note to hdivstool@microsoft.com.