Loan ChargeOff Prediction with Azure HDInsight Spark Clusters

A charged-off loan is a loan where a creditor (usually a lending institution) has declared that an amount of debt is unlikely to be collected, usually when the loan repayment is severely in arrears. Given that high charge-off has a negative impact on lending institutions’ year-end financials, lending institutions often monitor loan charge-off risk very closely to prevent loans from getting charged off. Using Azure HDInsight R Server, a lending institution can leverage machine learning predictive analytics to predict the likelihood of loans getting charged off and run a report on the analytics result stored in HDFS and hive tables.


Note: If you have already deployed this solution, click here to view your deployment.

Estimated provisioning time: 25 minutes

This solution will create an HDInsight Spark cluster with Microsoft R Server. This cluster will contain 2 head nodes, 2 worker nodes and 1 edge node with a total of 32 cores. The approximate cost for this HDInsight Spark cluster is $8.29/hour. Billing starts once a cluster is created and stops when the cluster is deleted. Billing is prorated per minute, so you should always delete your cluster when it is no longer in use. Use the Deployments page to delete the entire solution once you have finished.


There are multiple benefits to lending institutions equipping themselves with loan charge-off prediction data. Charging off a loan is a bank’s last resort for a loan that is severely in arrears. With prediction data at hand, a loan officer can offer personalised incentives like a lower interest rate or longer repayment period to help the customer keep making loan payments and thus prevent the loan from getting charged off. To obtain this type of prediction data, credit unions or banks often manually handcraft the data based on a customer’s past payment history and perform a simple statistical regression analysis. This method is highly liable to generate data compilation errors and not statistically sound.

This solution template offers an end-to-end solution for running predictive analytics on loan data and producing a score of charge-off probability. A PowerBI report will also walk you through the analysis and trend of credit loans and prediction of charge-off probability.

Business Perspective

This loan charge-off prediction uses simulated loan history data to predict the probability of loan charge-off in the immediate future (next three months). The higher the score, the higher the probability of the loan getting charged off in the future.

With the analytics data, the loan manager is also presented with the trends and analytics of charge-off loans by branch locations. Characteristics of high charge-off risk loans will help loan managers to make a business plan for loan offerings in that specific geographical area.

Microsoft R Server on HDInsight Spark clusters provides distributed and scalable machine learning capabilities for big data, leveraging the combined power of R Server and Apache Spark. This solution demonstrates how to develop machine learning models for predicting loan charge-off (including data processing, feature engineering, training and evaluating models), deploy the models as a web service (on the edge node) and consume the web service remotely with Microsoft R Server on Azure HDInsight Spark clusters. The final predictions are saved to a Hive table which can be visualised in Power BI.

Power BI also presents visual summaries of the loan payments and charge-off predictions (shown here with simulated data). You can try this dashboard out by clicking the “Try it Now” button to the right.

Data Scientist Perspective

This solution template walks through the end-to-end process of how to develop predict analytics using a set of simulated loan history data to predict loan charge-off risk. The data contains information like loan holder demographic data, loan amount, contractual loan duration and loan payment history. The solution template also includes a set of R scripts that perform data processing, feature engineering, and several different algorithms to train the data, and finally select the best-performing model to score the data to produce a probability score for each loan. The solution also includes scripts for deploying the model as a web service (on the edge node) and consuming the web service remotely with Microsoft R Server on Azure HDInsight Spark clusters.

Data scientists who are testing this solution can work with the provided R code from the browser-based Open Source Edition of RStudio Server which runs on the Edge Node of the Azure HDInsight Spark cluster. By setting the compute context the user can decide where the computation will be performed: locally on the edge node, or distributed across the nodes in the Spark cluster. All of the R code can also be found in the public GitHub repository. Have fun!


©2017 Microsoft Corporation. All rights reserved. This information is provided “as is” and may change without notice. Microsoft makes no warranties, express or implied, with respect to the information provided here. Third-party data was used to generate the solution. You are responsible for respecting the rights of others, including procuring and complying with relevant licences in order to create similar datasets.

Related solution architectures

Loan ChargeOff Prediction with SQL Servercha

This solution demonstrates how to build and deploy a machine learning model with SQL Server 2016 with R Services to predict whether a bank loan will need to be charged off within the next three months

Loan Credit Risk with SQL Server

By using SQL Server 2016 with R Services, a lending institution can make use of predictive analytics to reduce the number of loans they offer to those borrowers most likely to default, thereby increasing the profitability of their loan portfolio.