<< Back

Introduction to Metamorphic Testing on Classifiers

A considerable amount of research on metamorphic testing on classifiers is driven by Columbia University. This blog provides an overview of work done so far in the metamorphic testing approach used for classifier algorithms in data mining/machine learning. The majority of the research effort in the domain of machine learning focuses on building more accurate models that can better achieve the goal of automated learning from the real world. However, to date very little work has been done on assuring the correctness of the software applications that perform machine learning.

Formal proofs of an algorithm's optimal quality do not guarantee that an application implements or uses the algorithm correctly, and thus software testing is necessary.

Metamorphic testing is a testing technique that uses properties of functions such that it is possible to predict expected changes to the output for particular changes to the input, based on so-called "metamorphic relations" between given sets of inputs and their corresponding outputs. Xie et al. notes that although the correct output cannot be known in advance, if the change is not as expected, then a defect must exist.

Supervised ML applications consist of two phases. The first phase (called the training phase) analyzes the training data; the result of this analysis is a model that attempts to make generalizations about how the attributes relate to the label. In the second phase (called the testing phase), the model is applied to another, previously-unseen data set (the testing data) where the labels are unknown. In a classification algorithm, the system attempts to predict the label of each individual

The first step is to identify a set of properties ("metamorphic relations", or MRs) that relate multiple pairs of inputs and outputs of the target program. Then, pairs of source test cases and their corresponding follow-up test cases are constructed based on these MRs. We then execute all these test cases using the target program, and check whether the outputs of the source and follow-up test cases satisfy their corresponding MRs. e.g multiply a set of numbers by 2, there sum will be 2x since there input were just stretched out.

Some established MRs in classification algorithms are listed below. For more information about the MRs, please see my references.

MRs in classification algorithms:

  1. MR-0: Consistence with affine transformation.
  2. MR-1.1: Permutation of class labels.
  3. MR-1.2: Permutation of the attribute.
  4. MR-2.1: Addition of uninformative attributes.
  5. MR-2.2: Addition of informative attributes.
  6. MR-3.1: Consistence with re-prediction.
  7. MR-3.2: Additional training sample.
  8. MR-4.1: Addition of classes by duplicating samples.
  9. MR-4.2: Addition of classes by re-labeling samples.
  10. MR-5.1: Removal of classes.
  11. MR-5.2: Removal of samples.

So, it is not hard to see that metamorphic testing is simple to implement, effective, easily automatable, and independent of any particular programming language.

An Introduction to Automated Metamorphic System Testing (as described by Murphy et al.)

Automating testing at the system level by treating the application as a black box and checking that the metamorphic properties of the entire application hold after its execution. This will not require the tester to have access to the source code, but only to know the system's metamorphic properties.

The testing approach used by (Murphy et al.) is supported by an implementation framework called Amsterdam. This testing framework automates the process by which program input data is modified, multiple executions of the application with its different inputs are run in parallel, and the outputs of the executions are compared to check that the metamorphic properties are satisfied.


For instance, the manual transformation of the input data can be laborious and error-prone, especially when the input consists of large tables of data, rather than just scalars or small sets. On a similar note, input data that is not human-readable (for instance, binary files representing network traffic) cannot easily be manually modified and thus a tool is necessary.

One-off scripts could be created, but to date there is no general framework that addresses different types of transformations and different types of input formats for purposes of metamorphic testing.

Additionally, the manual comparison the program outputs can also cause problems. Like input data, comparing manually can be error-prone and tedious. Tools like "diff" maybe sometimes useful but not when they are large number of changes between them. Furthermore, utilizing automation tools like Amsterdam solves this problem.