• 7 min read

Getting AI/ML and DevOps working better together

Artificial Intelligence (AI) and machine learning (ML) technologies extend the capabilities of software applications that are now found throughout our daily life: digital assistants, facial recognition, photo captioning, banking services, and product recommendations. The difficult part about integrating AI or ML into an application is not the technology, or the math, or the science or the algorithms.

Artificial Intelligence (AI) and machine learning (ML) technologies extend the capabilities of software applications that are now found throughout our daily life: digital assistants, facial recognition, photo captioning, banking services, and product recommendations. The difficult part about integrating AI or ML into an application is not the technology, or the math, or the science or the algorithms. The challenge is getting the model deployed into a production environment and keeping it operational and supportable. Software development teams know how to deliver business applications and cloud services. AI/ML teams know how to develop models that can transform a business. But when it comes to putting the two together to implement an application pipeline specific to AI/ML — to automate it and wrap it around good deployment practices — the process needs some effort to be successful.

image

The need for aligned development approaches

DevOps has become the de-facto development standard for cloud services. It places an emphasis on process, automation, and fosters a culture that encourages new ways of working together across teams. DevOps is an application-centric paradigm that focuses on the platform, instrumentation, and process to support applications: what is the infrastructure needed to support the application? What tools can be used to automate it? What is the release process for QA/production? 

AI/ML projects have their own development methodologies including CRISP-DM and Microsoft Team Data Science Process (TDSP). Like DevOps, these methodologies are grounded in principles and practices learned from real-world projects. AI/ML teams use an approach unique to data science projects where there are frequent, small iterations to refine the data features, the model, and the analytics question. It’s a process intended to align a business problem with AI/ML model development. The release process is not a focus for CRISP-DM or TDSP and there is little interaction with an operations team. DevOps teams (today) are yet not familiar with the tools, languages, and artifacts of data science projects. 

DevOps and AI/ML development are two independent methodologies with a common goal: to put an AI application into production. Today it takes the effort to bridge the gaps between the two approaches. AI/ML projects need to incorporate some of the operational and deployment practices that make DevOps effective and DevOps projects need to accommodate the AI/ML development process to automate the deployment and release process for AI/ML models.

Integrating AI/ML teams, process, and tools

Based on lessons learned from several Microsoft projects including the Mobile Bank Fraud Solution, some suggestions for bridging the gap between DevOps and AI/ML projects follow.

DevOps for AI/ML

DevOps for AI/ML has the potential to stabilize and streamline the model release process. It is often paired with the practice and toolset to support Continuous Integration/Continuous Deployment (CI/CD). Here are some ways to consider CI/CD for AI/ML workstreams:

  • The AI/ML process relies on experimentation and iteration of models and it can take hours or days for a model to train and test. Carve out a separate workflow to accommodate the timelines and artifacts for a model build and test cycle. Avoid gating time-sensitive application builds on AM/ML model builds.
  • For AI/ML teams, think about models as having an expectation to deliver value over time rather than a one-time construction of the model. Adopt practices and processes that plan for and allow a model lifecycle and evolution.
  • DevOps is often characterized as bringing together business, development, release, and operational expertise to deliver a solution. Ensure that AI/ML is represented on feature teams and is included throughout the design, development, and operational sessions.

Establish performance metrics and operational telemetry for AI/ML

Use metrics and telemetry to inform what models will be deployed and updated. Metrics can be standard performance measures like precision, recall, or F1 scores. Or they can be scenario specific measures like the industry-standard fraud metrics developed to inform a fraud manager about a fraud model’s performance. Here are some ways to integrate AI/ML metrics into an application solution: 

  • Define model accuracy metrics and track them through model training, validation, testing, and deployment.
  • Define business metrics to capture the business impact of the model in operations. For an example see R notebook for fraud metrics.
  • Capture data metrics, like dataset sizes, volumes, update frequencies, distributions, categories, and data types. Model performance can change unexpectedly for many reasons and it’s expedient to know if changes are due to data.
  • Track operational telemetry about the model:  how often is it called? By which applications or gateways? Are there problems? What are the accuracy and usage trends? How much compute or memory does the model consume?
  • Create a model performance dashboard that tracks model versions, performance metrics, and data sets.

AI/ML models need to be updated periodically. Over time, and as new and different data becomes available — or customers or seasons or trends change — a model will need to be re-trained to continue to be effective. Use metrics and telemetry to help refine the update strategy and determine when a model needs to be re-trained.

Automate the end-to-end data and model pipeline

The AI/ML pipeline is an important concept because it connects the necessary tools, processes, and data elements to produce and operationalize an AI/ML model. It also introduces another dimension of complexity for a DevOps process. One of the foundational pillars of DevOps is automation, but automating an end-to-end data and model pipeline is a byzantine integration challenge.

Workstreams in an AI/ML pipeline are typically divided between different teams of experts where each step in the process can be very detailed and intricate. It may not be practical to automate across the entire pipeline because of the difference in requirements, tools, and languages. Identify the steps in the process that can be easily automated like the data transformation scripts, or data and model quality checks. Consider the following workstreams:  

Workstream Description Automation
Data Analysis    Includes data acquisition and focusing on exploring, profiling, cleaning, and transforming. Also includes enriching, and staging data for modeling. Develop scripts and tests to move and validate the data. Also create scripts to report on the data quality, changes, volume, and consistencies.
Experimentation    Includes feature engineering, model fitting, and model evaluation. Develop scripts, tests, and documentation to reproduce the steps and capture model outputs and performance.
Release Process Includes the process for deploying a model and data pipeline into production. Integrate the AI/ML pipeline into the release process
Operationalization Includes capturing operational and performance metrics. Create operational instrumentation for the AI/ML pipeline. For subsequent model retraining cycles, capture and store model inputs, and outputs.

Model Re-training and Refinement

Determine a cadence for model re-training. Instrument the AI/ML pipeline with alerts and notifications to trigger retraining.
Visualization Develop an AI/ML dashboard to centralize information and metrics related to the model and data. Include accuracy, operational characteristics, business impact, history, and versions. n/a

An automated end-to-end process for the AI/ML pipeline can accelerate development and drive reproducibility, consistency, and efficiency across AI/ML projects.

Versioning 

Versioning is about keeping track of an application’s artifacts and the changes to the artifacts.

In software development projects this includes code, scripts, documentation, and files. A similar practice is just as important for AI/ML projects because—typically—there are multiple components, each with separate release and versioning cycles. In AI/ML projects, the artifacts could include:

  • Data: training data, inference data, data metrics, graphs, plots, data structures, schemas
  • Models: trained models, scoring models, A/B testing models
  • Model outputs: predictions, model metrics, business metrics 
  • Algorithms, code, notebooks

Versioning can help provide:

  • Traceability for model changes from multiple collaborators
  • Audit trails for project artifacts
  • Information about which models are called from which applications

A practical example of the importance of versioning for the AI/ML team happens when the performance of a model changes unexpectedly, and the change has nothing to do with the model itself. The ability to easily trace back inputs, dependencies, model, and data set versions could save days or weeks of effort.

At a minimum, decide on a consistent naming convention and use it for the data files, folders, and AI/ML models. Several different teams will be involved in the modeling process and without naming conventions, there will be confusion over which data sets or model versions to use.

Consider container architectures

Container architectures have the potential to streamline and simplify model development, test, and deployment. And as a package-based interface, containers make it easy for software applications to connect. Containers create an abstraction layer between models and the underlying infrastructure. This lets the AI/ML team focus on model development and not worry about the platform. Containers can easily enable:

  • A/B testing 
  • Deployment to multiple environments (IoT edge, local desktop, Azure infrastructure)
  • Consistent environment configuration and setup to drive faster model development, test, and release cycles
  • Model portability and scalability

Recommended next steps

The adoption of DevOps has been very effective at bringing together software development and operations teams to simplify and improve deployment and release processes. As AI and ML become increasingly more important components for applications, more pressure will exist to ensure they are part of an organization’s DevOps model. The suggestions presented are examples of some steps to move towards an integration of two methodologies. To get started, please use some of the links below and please share your feedback and your experience!