Paper            Code

In critical real-world applications of machine learning, it is important to understand when a model’s predictions must not be trusted. On data drawn from the same distribution as the training set, machine learning models can often achieve super-human predictive performance. However, predictions on out-of-distribution (OOD) inputs should be ignored and OOD samples, such as data from novel, unseen classes, should be flagged.

Consider, for instance, a hospital with a severe shortage of qualified personnel. To make up for the lack of doctors, the hospital would like to use an automated system for real-time diagnosis from X-ray images (Task I) and a novelty detection system, which can run at the end of each week, to detect outbreaks of novel disease variants (Task II), like in the figure. In particular, the detection algorithm can be fine-tuned weekly with the unlabeled batch of data collected during the respective week.

Figure 1. Example of a semi-supervised novelty detection setting.

The novelty detection model helps to flag instances of new conditions and can request human review for these patients. The human experts can then label these images and include them in the labeled training set to update both the diagnostic prediction and the novelty detection systems. This process repeats each week and enables both diagnostic and novelty detection models to adjust to new emerging diseases.

Semi-supervised novelty detection

Scenarios like Figure 1 are called the semi-supervised novelty detection (SSND) setting. Concretely, SSND approaches can be trained on both a labeled ID training set (e.g. the dataset used for training the classifier for Task I) and an unlabeled set containing an unknown mixture of in-distribution (ID) and OOD samples.

Of course, in an SSND scenario one can still use an unsupervised novelty detection approach. However, this can be problematic for a number of reasons:

  • Unsupervised novelty detection (UND) methods use only ID data for training. However, it is difficult to detect all OOD data well when so little information is available. Indeed, UND methods have poor performance, especially on near-OOD data, when outliers are very similar to ID samples.
  • Augmented unsupervised novelty detection (A-UND) approaches use known OOD data as a surrogate for unknown novelties. In order for these methods to work well, it is necessary that surrogate and test OOD data are similar. But novel data is, by definition, unpredictible, and hence, it is often not possible to choose an appropriate set of known outliers for training these methods.

SSND methods have the advantage that they can always adapt to new emerging OOD data, without explicitly knowing which are the OOD samples in the unlabeled set. Thus, in principle, they can solve the issues faced by UND methods. However, prior SSND algorithms either don’t work with complex models like neural networks (e.g. Blanchard et al, Munoz-Mari et al) or have poor performance on near-OOD data (e.g. Yu et al, Kiryo et al).

Ensembles with regularized disagreement (ERD)

We set out to develop semi-supervised novelty detection methodology that can be applied on large scale problems which use deep neural networks. The approach that we propose can be described simply as follows:

Train an ensemble of classifiers such that the models' predictions disagree only on OOD data.

For good novelty detection, we need ensembles with the right amount of diversity. The models’ predictions should disagree on the unlabeled OOD samples, but agree on the unlabeled ID points (see Figure 2c). To achieve this, we need two important ingredients:

1) Training procedure: We train each model \(\hat{f}_c\) in the ensemble on a training set that consists of the labeled ID dataset \(S\) and the set \((U, c)\). The latter contains the samples in the unlabeled set \(U\), all labeled with the same class \(c\) selected at random from the set of ID labels. Since we train each model \(\hat{f}_c\) with a different label \(c\) assigned to \(U\), we avoid ensembles with too little disagreement, like in Figure 2a.

2) Regularized disagreement: The unlabeled set \(U\) contains both ID points (i.e. \(U_{ID}\)) and novel-class points (i.e. \(U_{OOD}\)). Assigning the label \(c\) to \(U_{ID}\) effectively introduces label noise in the training data. We denote by \(U_{ID}^c\) the points in \(U_{ID}\) whose true label is \(c\) and \(U_{ID}^{\neg c}\) the complement of this set. We want to limit ensemble disagreement to only the OOD samples in \(U\). Since noisy data is learned late during training, well after the clean data (see e.g. Li et al), we can use early stopping to get a model that does not fit the label \(c\) on \(U_{ID}\). Thus, we avoid ensembles with too much disagreement, like in Figure 2b.

Figure 2. We want ensembles with the right amount of disagreement, limited to only OOD samples.

Experimental results

We evaluate this method on both standard image datasets (e.g. SVHN, CIFAR10 etc) and a medical image benchmark. Even for difficult near-OOD settings, ERD outperforms state-of-the-art baselines.

Figure 3. ERD outperforms SOTA baselines on difficult near-OOD datasets.

These results show that it is indeed possible to use complex neural networks to detect samples from novel classes with great accuracy. Nevertheless, there is still room for improvement and for finding even more effective ways to obtain ensembles with regularized disagreement.


How do we detect a novel-class sample?

Given an ensemble, we propose to flag samples as OOD using the following metric: the average pairwise disagreement between the predictions of the models in the ensemble. In the paper, we discuss why this metric is better than the more commonly used method of averaging the predictions of the models.

Why early stopping? How about other regularization methods?

In principle, other regularization techniques could work too. However, early stopping has the advantage that we only need to train one model to select the optimal iteration using an ID validation set. Other regularization techniques require expensive grid searches for hyperparameter selection.
Bonus: In the paper we provide theoretical guarantees for early stopping, for a Gaussian mixture model and a two-layer neural network. Our result shows that there exists an optimal stopping time at which model \(\hat{f}_c\) predicts the label \(c\) on \(U_{OOD}\) and the correct label on \(U_{ID}\).

How large does the unlabeled set need to be for good performance?

Our method works well for a wide range of unlabeled set sizes, as we discuss in the paper. In practice, if the unlabeled set is too small, one can simply revert to using unsupervised novelty detection techniques.

Can the method be used for transductive novelty detection
i.e. evaluate on the same unlabeled set used for fine-tuning?

Yes! Check out the paper for the experiment results.

What is the computation cost of training ERD?

We only need to fine-tune two-model ensembles to get good performance with ERD, as we show in the paper. For instance, in applications like the one in Figure 1, ERD fine-tuning introduces little overhead and works well even with scarce resources (e.g. it takes around 5 minutes on 2 GPUs for the settings in Figure 3).

How does ERD compare to pretrained models like ViT?

OOD detection with pretrained ViT models performs great when the OOD data is similar to the data used for pretraining (e.g. CIFAR10 classes as ID and OOD and ImageNet used for pretraining). However, our experiments revealed that OOD detection performance deteriorates drastically when ID and OOD data are strikingly different from ImageNet, e.g. SVHN, medical data or even FashionMNIST.