My first experiment in confidence calibration and conformal prediction

Risk controlled FOD detection with calibrated semantic segmentation

machine-learning
conformal-prediction
confidence-calibration
risk-control
Author

Diogo Silva

Published

July 4, 2024

Abstract

Machine learning has been increasingly used in safety-critical applications such as Foreign Object Debris (FOD) detection. For the purposes of risk management and integration in decision chains, it is crucial to have reliable confidence estimates for a model’s predictions. We propose a system that combines confidence calibration and risk control through conformal prediction, for FOD detection using binary semantic segmentation. We evaluated the performance of a popular segmentation architecture, and studied the effect of applying risk control and confidence calibration (both location independent and dependent) on the model’s performance, confidence calibration, decision boundary and the ability to control the false negative rate. Results suggest that histogram binning can be effective for confidence calibration, that conformal prediction can be used to control the risk of the model’s predictions, and that both can be combined to improve confidence calibration and risk control. The location dependent calibration method failed to produce the desired results in this application.

Figure 1: An example of the application of risk control. The left image shows the original prediction, while the middle image shows the result after the application of conformal prediction. Red for false negatives, blue for false positives, white for true positives, and black for background. The right image represents the changes from the original prediction to the conformal prediction, where the orange shows the decrease in false negatives, and the purple the increase in false positives.

flowchart LR
  IN[FOD image] -->|inference|SEG([Semantic \n Segmentation])
  SEG --> OUTY["ŷ (prediction)"]
  SEG --> OUTP["p (confidence)"]
  SEG --> OUTZ["z (logits)"]
  OUTY --> CALIB([Calibrator])
  OUTP --> CALIB
  OUTZ --> CALIB
  CALIB --> COUTY["ŷ' \n can be differnet"]
  CALIB --> COUTP["q\n calibrated confidence"]
  COUTP --> CP([Risk control \n Conformal Prediction])
  COUTY --> CP
  CP --> OUT["Final\noutput"]
  
  DATA[(Dataset)] --> CS
  DATA-. training .-> SEG

  CS[Calibration set] --> CC([Calibration])
  CC-. creates .-> CALIB
  CS-. used in .-> CP

Figure 2: Overview of the steps for using both confidence calibration and conformal prediction for ensuring risk control in FOD semantic segmentation.

1 Introduction

Machine learning has been increasingly used in safety-critical applications, such as autonomous driving, medical imaging, and industrial automation. Foreign Object Debris (FOD) detection is such a safety-critical application. FOD can cause severe damage to aircraft and pose a threat to passengers’ safety and the European Aviation Safety Agency’s roadmap for Artifical Intelligence identifies FOD detection as one of the relevant use cases for AI in aviation (EASA Artificial Intelligence Roadmap 2.0 Published - A Human-Centric Approach to AI in Aviation EASA 2023).

In these applications, it is crucial to have reliable confidence estimates for the model’s predictions. In practice, this means that the model’s predicted probabilities should accurately reflect the true probabilities of the events. A well-calibrated model is essential for decision-making systems, as it allows for the estimation of the model’s uncertainty and the quantification of the risk associated with the model’s predictions.

While many machine learning models output confidence scores, these scores are often poorly calibrated (Guo et al. 2017), i.e. they do not accurately reflect the true probability of the events. Poorly calibrated models can lead to overconfident predictions, which can be dangerous in safety-critical applications. While later work suggests that recent popular architectures (Minderer et al. 2021), and even specialized loss functions such as focal loss (Mukhoti et al. 2020a), are less prone to miscalibration, explicitely quantifying and addressig the calibration of predicted confidences, as well as uncertainty, is important to control the risk in safety-critical applications.

Following (Almeida et al. 2023), we addresses systems that use low cost sensors for image acquisition, targeting integration in already used vehibles in ground operations. A good performing system in these conditions can be more easily integrated in the current infrastructure, and can be used to improve the safety of the operations. We indent to follow the work started in (Almeida et al. 2023), to address the problem of confidence calibration and risk control in FOD segmentation, and answer the following questions:

  • How calibrated are the predicted confidences of the segmentation models in these conditions?
  • What is the effect of confidence calibration on the performance of the models?
  • How can we control the risk of FOD segmentation?
  • How does confidence calibration intersect with risk control in this application?

In particular, we will later show, we use conformal prediction for risk control and a location independent and depedent calibraiton technique for confidence calibration. The overall architecture is shown in Figure 2.

In Section 2, we briefly present the problem of FOD detection (Section 2.1) and segmentation (Section 2.2), before moving on to the focus of this work: confidence calibration (Section 2.3) and risk control (Section 2.4). The research questions are addressed with the methodology presented in Section 3. Implementation details (Section 4) cover the dataset (Section 4.1), models and training procedures used in the experiments (Section 4.2). The choices for risk control and confidence calibration are detailed in Section 4.3. Results are are organized per research question in Section 5, where a discussion of the results is also presented. Finally, we conclude with the main takeaways and suggestions for future work (Section 6).

3 Methodology

In this section, we present the methodology used to answer the questions posed in the introduction, with respect to FOD semantic segmentation:

  1. How calibrated are the predicted confidences of the segmentation models in these conditions?
  2. How can we control the risk of FOD segmentation?
  3. What is the effect of confidence calibration on the performance of the models?
  4. How does confidence calibration intersect with risk control in this application?

To answer these questions, we must start from a trained segnmentation model, similarly to what was presented by Almeida et al. (2023). To answer question 1, we will evaluate this model on a test set and measure the ECE, MCE, D-ECE and D-MCE of the model predicted confidences. Will further present reliability diagrams for calibration error.

To answer question 2, we will apply the conformal prediction framework to the segmentation model, using the risk control process presented in Section 2.3.2. We will then evaluate the model’s performance with and without risk control, and compare the results. The performance will be measured using false negative rate, false discovery rate, mean intersection over union and pixel wise accuracy. Because the risk control procedure will change the decision boundary of the model, we will also evaluate the calibration of the model after risk control.

To answer question 3, we will apply a location independent and a location dependent calibration technique to the model. We will then evaluate the model’s performance with and without calibration, and compare the results.

Finally, question 4 is answered by combining the methodology of questions 2 and 3, i.e. we will use the risk control procedure from question 2 over the calibrated ouputs from question 3, and report on the performance impact of the model.

4 Implementation details

4.1 Dataset

We used the from (Almeida et al. 2023) to train the semantic segmentation models. In particular we used only the visible light spectrum. Images are of size 1920 \times 1080 pixels, and we croped the images with 10% overlap between crops to create a larger dataset. Because the dataset is small, and to increase generalization performance, augmenations were performed during training. Augmentations include random application of affine transformations (rotation, translation, resizing, shearing) and photometric distortions.

We used the ratios 80%, 10%, and 10%, for the training, validation and testing splits, respectively. Since the dataset consists of image sequences, the splits were not performed randomly, so that the same object doesn’t appear in both the training, validation and test sets. However, the sequences are not identified, so making perfect splits without relabeling the data was not possible. Still, the natural order of the filenames corresponds to the order in which the images were taken, so we used this information to split the dataset. Because the dataset has images from two different sensors, the splits were performed in such a way that each set from each sensor had roughly the same sequences.

We note that most images in the sets have an object. However, after the crops, and depending on the crop size, the number of images with objects can be significantly reduced, to the point of being a very small fraction of the dataset. We experimented with controlling the amount of samples with objects in the sets.

We further note that objects are very small compared to the image size, which results in a highly imbalanced segmentation problem, as the vast majority of pixels are background.

4.2 Model choice and training precedure

We used a simple FCN (Long, Shelhamer, and Darrell 2015) with a ResNet-50 backbone (He et al. 2016), pre-trained on ImageNet (Deng et al. 2009). The loss is a weighted sum between binary cross-entropy and Dice loss, with weights of 0.8 and 0.2, respectively. Training was done with the ADAM optimizer with a learning rate of 0.001. The models were trained for 200 epochs, and the best model according to mIoU on the validation set was picked.

4.3 Confidence calibration and risk control

In the experiments, we chose to use only histogram binning and its location dependent variant, due to their ease of implementation. The calibration was performed on the validation set, and the calibration metrics were computed on the test set.

To control the risk in the FOD detection system, the conformal prediction framework was used. Instead of modelling the error rate (as shown in Section 2.3.1), the focus was bounding the false negative rate (as mentioned in Section 2.3.2). Furthermore, within the scope binary semantic segmentation, the prediction set \cal{C}_\lambda of Equation 3 is parametrized by \lambda in the following manner

\cal{C}_{\lambda} (X_\text{test}) = \{ (i,j) \in \mathbf{I}: f(\mathbf{I; \theta}) \ge \lambda \}

where \lambda can be estimated as in Equation 5. Following Anastasios N. Angelopoulos, Bates, et al. (2023), we use SciPy’s(Virtanen et al. 2020) implementation of Brent’s method (Brent 1973) to find \lambda.

5 Results and discussion

All results will be reported with respect to the positive class (predicted or in ground truth), i.e. the class of interest. This is because mIoU, FNR and FDR are not defined if no predicted or ground truth positive pixels exist. Although ECE and D-ECE can be computed for the negative class, because the vast majority of pixels are easy negatives, this would dramatically skew the reported calibration metrics if both classes were combined. Thus, calbiration metrics are reported for the positive class only, for consistency. Table 1 shows some of the main results at a glance, including the performance of our baseline. For risk control, we have selected that the maximum FNR should be 0.1.

Table 1: Results for several metrics across the baseline and after appliation of several techniques and their combinations. CP refers to variants with risk control through conformal prediction, IHB and DHB refer to location independent and dependent histogram binning, respectively. FDR is the false discovery rate.
Model λ mIoU [%] FNR [%] FDR [%] ECE [%] D-ECE [%]
Baseline 0.50 78.1 14.0 10.8 2.1 3.3
CP 0.12 78.1 10.1 14.6 2.8 4.5
IHB 0.50 78.3 13.0 11.5 1.0 4.0
DHB 0.50 71.4 20.8 11.7 2.5 5.6
IHB+CP 0.33 77.9 9.8 14.9 1.0 4.8
DHB+CP 0.17 72.2 18.5 13.1 3.0 6.3

5.1 Question 1 - Are confidences calibrated?

While a requirement for minimum confidence calibration was not set, the default calibration errors for this model are fairly low (see first line of Table 1 and Figure 4 (a)). It’s plain that D-ECE is higher than ECE, which is to be expected, since the objects location on the image may play a role in the model’s confidence. It’s noteworthy that the bast majority of samples fall in the highest confidence bin.

(a) Bottom diagram shows the reliability diagram, which represents the calibration error in each of the confidence bins. Top plot shows the relative number of samples that exist on each bin.
(b) Reliability diagram, assigned samples per bin for location errors after the application of risk control with conformal prediction, without calibration.
Figure 4

5.2 Question 2 - What is the effect of risk control?

After applying risk control over the baseline model, the false negative rate was controlled, and its visual effect can be observed in Figure 1. The model’s calibration errors increased, as seen in line 2 of Table 1. This might have hapenned because the model’s decision boundary was changed to comply with the desired false negative rate (which it did), which means the model is more conservative in its predictions, and now the calibration error for the positive class includes more bins. The reliability diagram of Figure 4 (b) shows the model is now more calibrated for predictions close to a confidence of 0.5 or 1.0. Because the model is now more conservative, the false discovery rate increased, which is also to be expected, as decreasing the FNR by moving the decision boundary will necessarily increase FDR. The mIoU, on the hand, was maintained, which is a desired outcome, since the risk controlled model is still able to segment the objects with the same performance as the baseline.

5.3 Question 3 - What is the effect of confidence calibration?

After applying location independent histogram binning (line 3 of Table 1), ECE decreased, but D-ECE increased. This is concordant with the findings of Küppers et al. (2020), since this calibration method is invariant to location.

After applying location dependent histogram binning (line 4), mIoU decreased significantly, ECE was maintained, but D-ECE increased. This is not in line with existing literature (Küppers et al. 2020, 2022), and is not expected, since this calibration method specifically addresses the location invariance of standard histogram binning. However, it’s somewhat in line with the findings of Kumar, Liang, and Ma (2019), which shows that when the number of bins significantly increase, so does the calibration error. These findings may explain the negative effect on D-ECE, since the number of location bins was 20 for each axis, which resulted in 4000 bins in total, contrasting with 10 bins for standard histogram binning.

It’s also noteworthy that the number of samples in both the calibration and test sets is rather low. The distribution for the number of samples across bins doesn’t give a clear picture of the distribution of the samples over all bins, since each of those confidence bins is futher divided into 400 location bins (which are used to compute D-ECE). To ilustrate this, consider the baseline model, which has 10 (2.5\%) empty location bins (not subdivided in confidence bins), but 2216 (55.4\%) empty bins (location and confidence).

Observing the reliability diagram of Figure 5 (a), we see that the model is missing reported error in bins for the same decision boundary as the baseline in Figure 4 (a). This is a result of the application of histogram binning, which can change the ranking of the predictions, and in this case the change calibrated confidences are are a mere shift of the original distribution. The “missing” samples might have been moved to the bins to the left (since there is a slight increase in the number of samples in the lower confidence bins), or have been moved across the decision boundary to the background class.

Surprinsingly, this does not occur after the application of location dependent histogram binning, as seen in Figure 5 (b). The amount of samples on those confidence bins is actually greater than the baseline, but the error in those bins is also significantly higher. Still, ECE is not worse, and while this may seem strange, consider that ECE is the weighted average of the calibration error in each bin, and the bin with the vast majority of samples is well calibrated.

It should also be noted that although the overall calibration error is lower, the error is consistently torwards making overconfident predictions - a trend not observed in the baseline model. This happens after the application of both calibration methods, and is more pronounced in the location dependent calibration method.

(a) Reliability diagram and assigned samples per bin after the application of location independent histogram binning without risk control. Decision boundary for \lambda = 0.5.