Predictive maintenance is one of the most common machine learning use cases and with the latest advancements in information technology, the volume of stored data is growing faster in this domain than ever before which makes it necessary to leverage big data analytic capabilities to efficiently transform large amounts of data into business intelligence. Microsoft has published a series of learning materials including blogs, solution templates, modeling guides and sample tutorials in the domain of predictive maintenance. Recently, we extended those materials by providing a detailed step-by-step tutorial of using Spark Python API PySpark to demonstrate how to approach predictive maintenance for big data scenarios. The tutorial covers typical data science steps such as data ingestion, cleansing, feature engineering and model development.
Business Scenario and Data
The input data is simulated to reflect features that are generic for most of the predictive maintenance scenarios. To enable the tutorial to be completed very quickly, the data was simulated to be around 1.3 GB but the same PySpark framework can be easily applied to a much larger data set. The data is hosted on a publicly accessible Azure Blob Storage container and can be downloaded by clicking this link. In this tutorial, we import the data directly from the blob storage.
The data set has around 2 million records with 172 columns simulated for 1900 machines collected over 4 years. Each machine includes a device which stores data such as warnings, problems and errors generated by the machine over time. Each record has a Device ID and time stamp for each day and aggregated features for that day such as total number of a certain type of warning received in a day. Four categorical columns were also included to demonstrate generic handling of categorical variables. The goal is to predict if a machine will fail in the next 7 days. The last column of the data set indicates if a failure occurred on that day.
We formatted this tutorial as Jupyter notebooks because it is easy to show the step-by-step process this way. You can also easily compile the executable PySpark script(s) using your favorite IDE.
Specifications & Configurations
The hardware used in this tutorial is a Linux Data Science Virtual Machine with 32 cores and 448 GB memory. For more detailed information of the Data Science Virtue Machine, please visit the link. For the size of the data used in this tutorial (1.3 GB), a machine with less cores and memory would also be adequate. However, in real life scenarios, one should choose the hardware configuration that is appropriate for the specific big data use case. Jupyter Notebooks included in this tutorial can also be downloaded and run on any machine that has PySpark enabled.
The Spark version installed on the Linux Data Science Virtual Machine for this tutorial is 2.0.2 with Python version 2.7.5. Please see the tutorial page for some configurations that needs to be performed before running this tutorial on a Linux machine.
- The user should already know some basics of PySpark. This is not meant to be a PySpark 101 tutorial.
- Have PySpark (Spark 2.0., Python 2.7) already configured. Please note if you are using Python 3 on your machine, a few functions in this tutorial require some very minor tweaks because some Python 2 functions deprecated in Python 3.
- Blog post: Predictive Maintenance Modelling Guide in the Cortana Intelligence Gallery
- Predictive Maintenance Modelling Guide
- Predictive Maintenance Modelling Guide R Notebook
- Predictive Maintenance Modelling Guide Python Notebook
- Predictive Maintenance solution
- Predictive Maintenance Template
Special thanks to Said Bleik, Yiyu Chen and Ke Huang for learning PySpark together. Thank Fidan Boylu Uz and Danielle Dean for proof reading and modifying the tutorial materials.