• 4 min read

MLOps Blog Series Part 2: Testing robustness of secure machine learning systems using machine learning ops

Robustness is the ability of a closed-loop system to tolerate perturbations or anomalies while system parameters are varied over a wide range. There are three essential tests to ensure that the machine learning system is robust in the production environments: unit tests, data and model testing, and integration testing.

Robustness is the ability of a closed-loop system to tolerate perturbations or anomalies while system parameters are varied over a wide range. There are three essential tests to ensure that the machine learning system is robust in the production environments: unit testing, data and model testing, and integration testing.

Unit testing

Tests are performed on individual components that each have a single function within the bigger system (for example, a function that creates a new feature, a column in a DataFrame, or a function that adds two numbers). We can perform unit tests on individual functions or components; a recommended method for performing unit tests is the Arrange, Act, Assert (AAA) approach:

1.    Arrange: Set up the schema, create object instances, and create test data/inputs.
2.    Act: Execute code, call methods, set properties, and apply inputs to the components to test.
3.    Assert: Check the results, validate (confirm that the outputs received are as expected), and clean (test-related remains).

Data and model testing

It is important to test the integrity of the data and models in operation. Tests can be performed in the MLOps pipeline to validate the integrity of data and the model robustness for training and inference. The following are some general tests that can be performed to validate the integrity of data and the robustness of the models:

1.    Data testing: The integrity of the test data can be checked by inspecting the following five factors—accuracy, completeness, consistency, relevance, and timeliness. Some important aspects to consider when ingesting or exporting data for model training and inference include the following:

•    Rows and columns: Check rows and columns to ensure no missing values or incorrect patterns are found.

•    Individual values: Check individual values if they fall within the range or have missing values to ensure the correctness of the data.

•    Aggregated values: Check statistical aggregations for columns or groups within the data to understand the correspondence, coherence, and accuracy of the data.

2.   Model testing: The model should be tested both during training and after it has been trained to ensure that it is robust, scalable, and secure. The following are some aspects of model testing:

•    Check the shape of the model input (for the serialized or non-serialized model).

•    Check the shape and output of the model.

•    Behavioral testing (combinations of inputs and expected outputs).

•    Load serialized or packaged model artifacts into memory and deployment targets. This will ensure that the model is de-serialized properly and is ready to be served in the memory and deployment targets.

•    Evaluate the accuracy or key metrics of the ML model.

Integration testing

Integration testing is a process where individual software components are combined and tested as a group (for example, data processing or inference or CI/CD).

Integration of two modules represented by two circles overlapping in the center.

Figure 1: Integration testing (two modules)

Let’s look at a simple hypothetical example of performing integration testing for two components of the MLOps workflow. In the Build module, data ingestion and model training steps have individual functionalities, but when integrated, they perform ML model training using data ingested to the training step. By integrating both module 1 (data ingestion) and module 2 (model training), we can perform data loading tests (to see whether the ingested data is going to the model training step), input and outputs tests (to confirm that expected formats are inputted and outputted from each step), as well as any other tests that are use case-specific.

In general, integration testing can be done in two ways:

1.    Big Bang testing: An approach in which all the components or modules are integrated simultaneously and then tested as a unit.

2.    Incremental testing: Testing is carried out by merging two or more modules that are logically connected to one another and then testing the application’s functionality. Incremental tests are conducted in three ways:

•    Top-down approach

•    Bottom-up approach

•    Sandwich approach: a combination of top-down and bottom-up

Integration testing can test the modules using a bottom-up or top-down approach.

Figure 2: Integration testing (incremental testing)

The top-down testing approach is a way of doing integration testing from the top to the bottom of the control flow of a software system. Higher-level modules are tested first, and then lower-level modules are evaluated and merged to ensure software operation. Stubs are used to test modules that aren’t yet ready. The advantages of a top-down strategy include the ability to get an early prototype, test essential modules on a high-priority basis, and uncover and correct serious defects sooner. One downside is that it necessitates a large number of stubs, and lower-level components may be insufficiently tested in some cases.

The bottom-up testing approach tests the lower-level modules first. The modules that have been tested are then used to assist in the testing of higher-level modules. This procedure is continued until all top-level modules have been thoroughly evaluated. When the lower-level modules have been tested and integrated, the next level of modules is created. With the bottom-up technique, you don’t have to wait for all the modules to be built. One downside is those essential modules (at the top level of the software architecture) that impact the program’s flow are tested last and are thus more likely to have defects.
The sandwich testing approach tests top-level modules alongside lower-level modules, while lower-level components are merged with top-level modules and evaluated as a system. This is termed hybrid integration testing because it combines top-down and bottom-up methodologies.

Learn more

For further details and to learn about hands-on implementation, check out the Engineering MLOps book, or learn how to build and deploy a model in Azure Machine Learning using MLOps in the “Get Time to Value with MLOps Best Practices” on-demand webinar. Also, check out our recently announced blog about solution accelerators (MLOps v2) to simplify your MLOps workstream in Azure Machine Learning.