Abstract

Machine learning has been increasingly used in safety-critical applications such as Foreign Object Debris (FOD) detection. For the purposes of risk management and integration in decision chains, it is crucial to have reliable confidence estimates for a model’s predictions. We propose a system that combines confidence calibration and risk control through conformal prediction, for FOD detection using binary semantic segmentation. We evaluated the performance of a popular segmentation architecture, and studied the effect of applying risk control and confidence calibration (both location independent and dependent) on the model’s performance, confidence calibration, decision boundary and the ability to control the false negative rate. Results suggest that histogram binning can be effective for confidence calibration, that conformal prediction can be used to control the risk of the model’s predictions, and that both can be combined to improve confidence calibration and risk control. The location dependent calibration method failed to produce the desired results in this application.

Figure 1: An example of the application of risk control. The left image shows the original prediction, while the middle image shows the result after the application of conformal prediction. Red for false negatives, blue for false positives, white for true positives, and black for background. The right image represents the changes from the original prediction to the conformal prediction, where the orange shows the decrease in false negatives, and the purple the increase in false positives.

flowchart LR
  IN[FOD image] -->|inference|SEG([Semantic \n Segmentation])
  SEG --> OUTY["ŷ (prediction)"]
  SEG --> OUTP["p (confidence)"]
  SEG --> OUTZ["z (logits)"]
  OUTY --> CALIB([Calibrator])
  OUTP --> CALIB
  OUTZ --> CALIB
  CALIB --> COUTY["ŷ' \n can be differnet"]
  CALIB --> COUTP["q\n calibrated confidence"]
  COUTP --> CP([Risk control \n Conformal Prediction])
  COUTY --> CP
  CP --> OUT["Final\noutput"]
  
  DATA[(Dataset)] --> CS
  DATA-. training .-> SEG

  CS[Calibration set] --> CC([Calibration])
  CC-. creates .-> CALIB
  CS-. used in .-> CP

Figure 2: Overview of the steps for using both confidence calibration and conformal prediction for ensuring risk control in FOD semantic segmentation.

1 Introduction

Machine learning has been increasingly used in safety-critical applications, such as autonomous driving, medical imaging, and industrial automation. Foreign Object Debris (FOD) detection is such a safety-critical application. FOD can cause severe damage to aircraft and pose a threat to passengers’ safety and the European Aviation Safety Agency’s roadmap for Artifical Intelligence identifies FOD detection as one of the relevant use cases for AI in aviation (“EASA Artificial Intelligence Roadmap 2.0 Published - A Human-Centric Approach to AI in Aviation EASA” 2023).

In these applications, it is crucial to have reliable confidence estimates for the model’s predictions. In practice, this means that the model’s predicted probabilities should accurately reflect the true probabilities of the events. A well-calibrated model is essential for decision-making systems, as it allows for the estimation of the model’s uncertainty and the quantification of the risk associated with the model’s predictions.

While many machine learning models output confidence scores, these scores are often poorly calibrated (Guo et al. 2017), i.e. they do not accurately reflect the true probability of the events. Poorly calibrated models can lead to overconfident predictions, which can be dangerous in safety-critical applications. While later work suggests that recent popular architectures (Minderer et al. 2021), and even specialized loss functions such as focal loss (Mukhoti et al. 2020a), are less prone to miscalibration, explicitely quantifying and addressig the calibration of predicted confidences, as well as uncertainty, is important to control the risk in safety-critical applications.

Following (Almeida et al. 2023), we addresses systems that use low cost sensors for image acquisition, targeting integration in already used vehibles in ground operations. A good performing system in these conditions can be more easily integrated in the current infrastructure, and can be used to improve the safety of the operations. We indent to follow the work started in (Almeida et al. 2023), to address the problem of confidence calibration and risk control in FOD segmentation, and answer the following questions:

How calibrated are the predicted confidences of the segmentation models in these conditions?
What is the effect of confidence calibration on the performance of the models?
How can we control the risk of FOD segmentation?
How does confidence calibration intersect with risk control in this application?

In particular, we will later show, we use conformal prediction for risk control and a location independent and depedent calibraiton technique for confidence calibration. The overall architecture is shown in Figure 2.

In Section 2, we briefly present the problem of FOD detection (Section 2.1) and segmentation (Section 2.2), before moving on to the focus of this work: confidence calibration (Section 2.3) and risk control (Section 2.4). The research questions are addressed with the methodology presented in Section 3. Implementation details (Section 4) cover the dataset (Section 4.1), models and training procedures used in the experiments (Section 4.2). The choices for risk control and confidence calibration are detailed in Section 4.3. Results are are organized per research question in Section 5, where a discussion of the results is also presented. Finally, we conclude with the main takeaways and suggestions for future work (Section 6).

2 Related work

In this section, we briefly present the problems associated with FOD, before moving on to the relevant related work to answers the questions previously posed.

2.2 Semantic segmentation

Semantic segmentation is the task of assigning a class label to each pixel in an image. Multiple objects of the same class are not distinguished in the same image (in contrast to instance segmentation).

Semantic segmentation has been widely researched in computer vision, and several models have been proposed to solve it. Popular performant models include Fully Convolutional Networks (FCN) (Long, Shelhamer, and Darrell 2015), U-Net (Ronneberger, Fischer, and Brox 2015), and DeepLab (Chen et al. 2017).

A semantic segmentation model can be seen as a function that maps an H \times W \times C image \mathbf{I} to an H \times W segmentation map \mathbf{S}, where each pixel in the segmentation map S_{ij} is assigned a class label k \in \{0, \dots, K-1\}, and where H, W and C are the height, width and number of channels \mathbf{I}, in a domain where there are K semantic categories.

Let f(\mathbf{I};\theta) be the segmentation model, parameterized by \theta. For each pixel (i,j) \in \mathbf{I}, the output is a probability distribution P_{i,j,k} = f(\mathbf{I};\theta)_{i,j,k} over the K classes. The final prediction \hat{\mathbf{S}} is often obtained by taking the argmax of the predicted probabilities along the class dimension:

\hat{\mathbf{S}}_{i,j} = \arg\max_{k} \mathbf{P}_{i,j,k}

For the task at hand, the problem can be formulated for K=1, since we’re only interested in detecting the presence of a FOD in the image. Thus, we have a binary segmentation problem and P_{i,j} represents the probability that pixel (i,j) maps to a FOD. The prediction \hat{\mathbf{S}} is then a binary mask, where

\hat{\mathbf{S}}_{i,j} = \begin{cases} 1 & \text{if } P_{i,j} \ge \lambda \\ 0 & \text{otherwise} \end{cases}

where \lambda is often set to 0.5.

A simple and widely used loss function that can be used to guide the training process is binary cross-entropy applied at the pixel level. The Dice loss (Sudre et al. 2017), based on the Dice coefficient, was introduced to address highly unbalanced segmentations. The Jaccard loss, based on the Jaccard index (also known as Intersection over Union in computer vision), is another loss function that has been used in object detection and segmentation (Rahman and Wang 2016). Focal Loss (Lin et al. 2017) is an extension of cross-entropy that addresses the class imbalance problem by downweighting the loss for well-classified examples, introduced to improve object detection, but can also be used in semantic segmentation. While other loss function exist, we will limit our discussion to these three, as they are the most relevant for the task at hand and are widely used.

In semantic segmentaion, specific metrics must be considered (Long, Shelhamer, and Darrell 2015), such as pixel accuracy, mean pixel accuracy, mean intersection over union (mIoU), and frequency weighted IoU. The mean intersection over union (mIoU) is a popular metric for both object detection and semantic segmentation, as it is easy to interprect as how good a fit a predicted mask is to the ground truth mask.

2.3 Risk control

Several methods have been proposed, which can quantify and address uncertainty in machine learning models, such as Monte Carlo dropout (Gal and Ghahramani 2016), Bayesian neural networks, evidential deep learning (Sensoy, Kaplan, and Kandemir 2018) and conformal prediction (Vovk, Gammerman, and Shafer 2005). We will limit our discussion to conformal prediction, control the risk associated with the model’s predictions. Uncertainty quantification was not addressed in this work, but we acknowledge that the same framework can be used for this purpose.

2.3.1 Conformal prediction (CP)

Conformal prediction is a framework for constructing prediction sets that provide valid coverage guarantees, i.e., the true label is within the prediction set with a certain probability (Vovk, Gammerman, and Shafer 2005). Similarly, for regression problems, the true value is within a prediction interval with a certain probability. It’s also often formulated as contructing prediction sets with a certain error rate \alpha, e.g. for N predictions, we expect that (1-\alpha) \times N of the prediction sets will contain the true label. CP has been successfully applied to several machine learning tasks, such as multilabel image classification, tumor segmentation, time-series weather prediction (Anastasios Nikolas Angelopoulos et al. 2024), and natural language processing (Campos et al. 2024).

CP is an atractive method as it does not require any assumptions about the data distribution, the model’s architecture or even the heuristic used to quantify uncertainty (also called nonconformity measure). From a trained model, CP can be summarized in the following steps (Anastasios N. Angelopoulos, Bates, et al. 2023):

Define the nonconformity measure, which is a function s(x,y) \in \Re that quantifies how much a new sample deviates from the calibration data, and grows as the deviation increases.
Compute \hat{q} as the \frac{(n+1)(1-\alpha)}{n} quantile of the calibration nonconformity scores s_1 = s(X_1, Y_1), \dots, s_n = s(X_n, Y_n).
For a new sample X_{\text{test}}, with unknown Y_{\text{test}}, compute the prediction set C(X_{\text{test}}):

C(X_{\text{test}}) = \{y \in Y : s(X_{\text{test}}, y) \leq \hat{q}\}

From this formulation, we get the following coverage guarantee:

P(Y_{\text{test}} \in C(X_{\text{test}})) \geq 1 - \alpha \tag{1}

This means that for a given desired error rate \alpha, the prediction set C(X_{\text{test}}) will contain the true label with probability at least 1-\alpha. Although this guarantee holds for any nonconformity measure, the usefulness of the prediciton set depends on its quality and adequacy to the task at hand. For example, if the nonconformity scores are random noise, the predicted sets would contain random samples of the target space and also be large enough to comply with the coverage guarantee (Anastasios N. Angelopoulos, Bates, et al. 2023). Yet, sets with these properties would have limited usefulness. On the other hand, if the nonconformity measure is well defined and captures the model’s uncertainty, the prediction sets will be small and contain the true label with probability corresponding to the desired \alpha.

2.4 Confidence calibration

The output of many machine learning models include not only a prediction, but also a confidence score. The predicted confidence score is supposed a measure of how confident the model is in its prediction. In this section, we discuss the problem of confidence calibration.

2.4.1 Definition and measures

Previous research has shown that many machine learning models are poorly calibrated (Guo et al. 2017). Confidence calibration is the process of ensuring that the model’s predicted probabilities accurately reflect the true probabilities of the events. For the purposes of calibrating model confidences, we must first define what a calibrated model is. For a simple binary classification model, a calibrated model is one where the predicted probabilities of the positive class are equal to the true probabilities of the positive class. For a model f(X; \theta) that outputs a probability p_i for the positive class, the model is calibrated if the following holds for all samples i:

P(\hat{Y}_i = Y_i | \hat{P}_i = p_i) = p_i

where \hat{Y}_i is the predicted label, Y_i is the true label and \hat{P}_i is the predicted confidence.

Common measures of calibration include reliability diagrams and expected calibration error (ECE). Reliability diagrams are visual representations of the calibration of a model, where the predicted confidence are binned and the accuracy of the predictions in each bin is plotted against the average predicted confidence. These plots are interesting because they provide a simple way to evaluate how the model is performing across the confidence spectrum and how that correlates with data distribution across the confidence spectrum.

ECE is a scalar measure of calibration that is computed by computing the expected value of the difference between the probability of having a correct prediction and the predicted confidence: \text{ECE} = \mathbb{E} [P(\hat{Y}_i = Y_i | \hat{P})_i = p ) - p ]

Since the predicted confidence is a continous random variable, the ECE can be efficiently approximated by binning the predicted confidences and computing the difference between bin accuracy and bin average confidence (Guo et al. 2017). ECE is then the weighted sum of those diferences:

\text{ECE} = \sum_{m \in B} \frac{n_m}{N} | acc(m) - conf(m) |

Another metric that can be useful in some applications the Maximum Calibration Error (MCE), which is the maximum difference between the predicted confidence and the accuracy of the model. This can be efficiently approximated by computing the maximum difference between bin accuracy and bin average confidence.

2.4.2 Measuring location dependent calibration

In the case of object detection and segmentation, the predicted confidence may be dependend on the location of the object in the image. For semantic segmentation, and an image of height H and width W, consider the pixel-wise class annotations \bar{Y}_j \in \cal{Y} = \{1, \dots, Y\} for all pixels j \in \cal{J} = \{1, ..., H \times W\}. A semantic segmentation model F_{\text{sem}} outputs pixel-wise labels \hat{Y}_j and class probabilities \hat{P}_j for all pixels j, with relative position R_j within an image. Thus, the perfect calibration is given by (Küppers et al. 2022):

P(\hat{Y}_j = \bar{Y}_j | \hat{P}_j = p, R_j=r) = p, \\ \forall p \in [0,1], r \in \cal{R}^\star, j \in \cal{J} \tag{6}

In this definition, the calibration is location dependent, as the calibration is defined for each pixel j and its relative position R_j within the image. This means the confidence score should reflect not only the confidence of the prediction, but also its relative position.

In semantic segmentation, the detection expected calibration error (D-ECE) (Küppers et al. 2020), can be used to measure location dependent calibration. D-ECE is an extension of ECE and takes into account the relative location of each prediction. In the case of semantic segmentation, where we have a prediction for each pixel, the location is in relation to the image dimensions.

\text{D-ECE} = \mathbb{E} [ P(\hat{Y}_j = \bar{Y}_j | \hat{P}_j=p, R_j=r^\star) - p) ]

As in the case of ECE, D-ECE can be approximated by binning not only over the probability spectrum, but also over the image space:

\text{D-ECE} = \sum_{m \in B} \frac{1}{n_m} | acc(m) - conf(m) |

where B is the set of all bins, and n_m, acc(m) and conf(m) are the number of samples, accuracy and average predicted confidence in bin m. Note that now B includes bins not only for the predicted confidence, but also for the relative position of the prediction. For example, if we were to divide the image space into 20 bins in each axis, and 10 confidence bins, we would have 4000 bins in total.

On the binary case, models usually output only the confidence for the positive class. The confidence for the negative class is complement for the positive class. Since the prediction is usually the result of a threshold over the confidence, one should note that no confidence will be below the threshold value. For example, for a threshold of t=0.5, if the model outputs a confidence of 0.49, the prediction will be negative and the confidence will be 0.51.

Similarly to MCE, we can define the Detection Maximum Calibration Error (D-MCE) as the maximum difference between the predicted confidence and the accuracy of the model on any bin, including the image space dimensions.

2.4.3 Calibration techniques

We can define the goal of confidence calibration in the binary setting as follows. Consider that y \in \{0,1\} and we assume we have a model f(X; \theta) that outputs a non-probabilistic output z_i (the logit) for sample i, and a probability p_i for the positive class, derived from z_i (usually with the sigmoid function \sigma). The goal of calibation is to find a calibrated probability q_i based on y_i, p_i and z_i (Guo et al. 2017).

There have been many proposals for calibration procedures. Some loss functions, like focal loss, also produce models that are more calibrated (Mukhoti et al. 2020b). We’ll consider only post-hoc methods, such as histogram pinning (Zadrozny and Elkan 2002), isotonic regression (Zadrozny and Elkan 2002), Platt scaling (Platt et al. 1999), temperature scaling (Guo et al. 2017), Beta calibration (Kull, Filho, and Flach 2017) (later generalized with Dirichlet calibration (Kull et al. 2019)), among others. These are all methods that can be applied to a trained model without changing its internal architecture or training procedure. Calibration techniques can be roughly divided into two categories (Küppers et al. 2022): binning and scaling. Although more exist, we’ll consider one simple method in each of these categories: histogram binning and Platt scaling.

Histogram binning is a non-parametric calibration method that divides the range of predicted confidence into M bins. A common way of defining bin boundaries is to divide the range of predicted confidences into M equally spaced bins, but other strategies exist, e.g. equalize number of samples in each bin (Guo et al. 2017). Histogram binning assumes the following minimization problem:

\min_{\theta_1, \dots, \theta_M} \sum_{m=1}^M \sum_{i=1}^{n_m} \mathbb{1} ( a_m \le p_i < a_{m+1} ) (\theta_m - y_i)^2 \tag{7}

where \mathbb{1} is the indicator function, n_m is the number of samples in m-th bin, p_i is the predicted confidence for sample i, a_m is the lower bin boundary, and \theta_m is the assigned rectified confidence for samples in bin m. With this formulation, \theta_m defaults to bin accuracy (Guo et al. 2017). When a predicted confidence p_i falls in bin B_m, the calibrated confidence q_i is set \theta_m.

Platt scaling uses logistic regression to directly change the logits z_i. In the case of binary classification, the sigmoid function would be altered in the following way: \sigma(z_i) = \frac{1}{1 + e^{-z_i}} \rightarrow \frac{1}{1 + e^{- a \times z_i + b}}, a,b \in \Re

The parameters a and b can be estimated using the negative log-likelihood loss (Guo et al. 2017), over a calibration set.

We note that calibration with histogram binning may change the ranking of the predictions, while Platt scaling (and also temperature scaling) does not.

In the case of semantic segmentation, the previously presented calibration methods can be adapted to be location dependent (Küppers et al. 2020, 2022), similar to the adaptation made for the D-ECE. We present only the adaptation for histogram binning.

Histogram binning can be extended to add spatial dimensions to the bins, e.g. bounding box center and shape for object detection, or pixel location for semantic segmentation. The problem formulation is the same as the one presented in Equation 7, but the bins are now multidimensional over confidence and relative location w.r.t. to image dimensions.

3 Methodology

In this section, we present the methodology used to answer the questions posed in the introduction, with respect to FOD semantic segmentation:

How calibrated are the predicted confidences of the segmentation models in these conditions?
How can we control the risk of FOD segmentation?
What is the effect of confidence calibration on the performance of the models?
How does confidence calibration intersect with risk control in this application?

To answer these questions, we must start from a trained segnmentation model, similarly to what was presented by Almeida et al. (2023). To answer question 1, we will evaluate this model on a test set and measure the ECE, MCE, D-ECE and D-MCE of the model predicted confidences. Will further present reliability diagrams for calibration error.

To answer question 2, we will apply the conformal prediction framework to the segmentation model, using the risk control process presented in Section 2.3.2. We will then evaluate the model’s performance with and without risk control, and compare the results. The performance will be measured using false negative rate, false discovery rate, mean intersection over union and pixel wise accuracy. Because the risk control procedure will change the decision boundary of the model, we will also evaluate the calibration of the model after risk control.

To answer question 3, we will apply a location independent and a location dependent calibration technique to the model. We will then evaluate the model’s performance with and without calibration, and compare the results.

Finally, question 4 is answered by combining the methodology of questions 2 and 3, i.e. we will use the risk control procedure from question 2 over the calibrated ouputs from question 3, and report on the performance impact of the model.

4 Implementation details

4.1 Dataset

We used the from (Almeida et al. 2023) to train the semantic segmentation models. In particular we used only the visible light spectrum. Images are of size 1920 \times 1080 pixels, and we croped the images with 10% overlap between crops to create a larger dataset. Because the dataset is small, and to increase generalization performance, augmenations were performed during training. Augmentations include random application of affine transformations (rotation, translation, resizing, shearing) and photometric distortions.

We used the ratios 80%, 10%, and 10%, for the training, validation and testing splits, respectively. Since the dataset consists of image sequences, the splits were not performed randomly, so that the same object doesn’t appear in both the training, validation and test sets. However, the sequences are not identified, so making perfect splits without relabeling the data was not possible. Still, the natural order of the filenames corresponds to the order in which the images were taken, so we used this information to split the dataset. Because the dataset has images from two different sensors, the splits were performed in such a way that each set from each sensor had roughly the same sequences.

We note that most images in the sets have an object. However, after the crops, and depending on the crop size, the number of images with objects can be significantly reduced, to the point of being a very small fraction of the dataset. We experimented with controlling the amount of samples with objects in the sets.

We further note that objects are very small compared to the image size, which results in a highly imbalanced segmentation problem, as the vast majority of pixels are background.

4.2 Model choice and training precedure

We used a simple FCN (Long, Shelhamer, and Darrell 2015) with a ResNet-50 backbone (He et al. 2016), pre-trained on ImageNet (Deng et al. 2009). The loss is a weighted sum between binary cross-entropy and Dice loss, with weights of 0.8 and 0.2, respectively. Training was done with the ADAM optimizer with a learning rate of 0.001. The models were trained for 200 epochs, and the best model according to mIoU on the validation set was picked.

4.3 Confidence calibration and risk control

In the experiments, we chose to use only histogram binning and its location dependent variant, due to their ease of implementation. The calibration was performed on the validation set, and the calibration metrics were computed on the test set.

To control the risk in the FOD detection system, the conformal prediction framework was used. Instead of modelling the error rate (as shown in Section 2.3.1), the focus was bounding the false negative rate (as mentioned in Section 2.3.2). Furthermore, within the scope binary semantic segmentation, the prediction set \cal{C}_\lambda of Equation 3 is parametrized by \lambda in the following manner

\cal{C}_{\lambda} (X_\text{test}) = \{ (i,j) \in \mathbf{I}: f(\mathbf{I; \theta}) \ge \lambda \}

where \lambda can be estimated as in Equation 5. Following Anastasios N. Angelopoulos, Bates, et al. (2023), we use SciPy’s(Virtanen et al. 2020) implementation of Brent’s method (Brent 1973) to find \lambda.

5 Results and discussion

All results will be reported with respect to the positive class (predicted or in ground truth), i.e. the class of interest. This is because mIoU, FNR and FDR are not defined if no predicted or ground truth positive pixels exist. Although ECE and D-ECE can be computed for the negative class, because the vast majority of pixels are easy negatives, this would dramatically skew the reported calibration metrics if both classes were combined. Thus, calbiration metrics are reported for the positive class only, for consistency. Table 1 shows some of the main results at a glance, including the performance of our baseline. For risk control, we have selected that the maximum FNR should be 0.1.

Table 1: Results for several metrics across the baseline and after appliation of several techniques and their combinations. CP refers to variants with risk control through conformal prediction, IHB and DHB refer to location independent and dependent histogram binning, respectively. FDR is the false discovery rate.

Model	λ	mIoU [%]	FNR [%]	FDR [%]	ECE [%]	D-ECE [%]
Baseline	0.50	78.1	14.0	10.8	2.1	3.3
CP	0.12	78.1	10.1	14.6	2.8	4.5
IHB	0.50	78.3	13.0	11.5	1.0	4.0
DHB	0.50	71.4	20.8	11.7	2.5	5.6
IHB+CP	0.33	77.9	9.8	14.9	1.0	4.8
DHB+CP	0.17	72.2	18.5	13.1	3.0	6.3

5.1 Question 1 - Are confidences calibrated?

While a requirement for minimum confidence calibration was not set, the default calibration errors for this model are fairly low (see first line of Table 1 and Figure 4 (a)). It’s plain that D-ECE is higher than ECE, which is to be expected, since the objects location on the image may play a role in the model’s confidence. It’s noteworthy that the bast majority of samples fall in the highest confidence bin.

(a) Bottom diagram shows the reliability diagram, which represents the calibration error in each of the confidence bins. Top plot shows the relative number of samples that exist on each bin.

5.2 Question 2 - What is the effect of risk control?

After applying risk control over the baseline model, the false negative rate was controlled, and its visual effect can be observed in Figure 1. The model’s calibration errors increased, as seen in line 2 of Table 1. This might have hapenned because the model’s decision boundary was changed to comply with the desired false negative rate (which it did), which means the model is more conservative in its predictions, and now the calibration error for the positive class includes more bins. The reliability diagram of Figure 4 (b) shows the model is now more calibrated for predictions close to a confidence of 0.5 or 1.0. Because the model is now more conservative, the false discovery rate increased, which is also to be expected, as decreasing the FNR by moving the decision boundary will necessarily increase FDR. The mIoU, on the hand, was maintained, which is a desired outcome, since the risk controlled model is still able to segment the objects with the same performance as the baseline.

5.3 Question 3 - What is the effect of confidence calibration?

After applying location independent histogram binning (line 3 of Table 1), ECE decreased, but D-ECE increased. This is concordant with the findings of Küppers et al. (2020), since this calibration method is invariant to location.

After applying location dependent histogram binning (line 4), mIoU decreased significantly, ECE was maintained, but D-ECE increased. This is not in line with existing literature (Küppers et al. 2020, 2022), and is not expected, since this calibration method specifically addresses the location invariance of standard histogram binning. However, it’s somewhat in line with the findings of Kumar, Liang, and Ma (2019), which shows that when the number of bins significantly increase, so does the calibration error. These findings may explain the negative effect on D-ECE, since the number of location bins was 20 for each axis, which resulted in 4000 bins in total, contrasting with 10 bins for standard histogram binning.

It’s also noteworthy that the number of samples in both the calibration and test sets is rather low. The distribution for the number of samples across bins doesn’t give a clear picture of the distribution of the samples over all bins, since each of those confidence bins is futher divided into 400 location bins (which are used to compute D-ECE). To ilustrate this, consider the baseline model, which has 10 (2.5\%) empty location bins (not subdivided in confidence bins), but 2216 (55.4\%) empty bins (location and confidence).

Observing the reliability diagram of Figure 5 (a), we see that the model is missing reported error in bins for the same decision boundary as the baseline in Figure 4 (a). This is a result of the application of histogram binning, which can change the ranking of the predictions, and in this case the change calibrated confidences are are a mere shift of the original distribution. The “missing” samples might have been moved to the bins to the left (since there is a slight increase in the number of samples in the lower confidence bins), or have been moved across the decision boundary to the background class.

Surprinsingly, this does not occur after the application of location dependent histogram binning, as seen in Figure 5 (b). The amount of samples on those confidence bins is actually greater than the baseline, but the error in those bins is also significantly higher. Still, ECE is not worse, and while this may seem strange, consider that ECE is the weighted average of the calibration error in each bin, and the bin with the vast majority of samples is well calibrated.

It should also be noted that although the overall calibration error is lower, the error is consistently torwards making overconfident predictions - a trend not observed in the baseline model. This happens after the application of both calibration methods, and is more pronounced in the location dependent calibration method.

(a) Reliability diagram and assigned samples per bin after the application of **location independent histogram binning without risk control**. Decision boundary for \lambda = 0.5.

5.4 Question 4 - How does confidence calibration interact with risk control?

Risk control combined with histogram binning sucessfully controlled the FNR, but increased D-ECE. When combined with location dependent histogram binning, the FNR was not controlled and ECE increased. In this combination, mIoU also decreased significantly.

The combination of histogram binning with risk control produced desirable results, with the FNR successfully bounded. The effect of this combination on mIoU, FNR and FDR was the same as in the case of risk control alone. However, the decision boundary is higher than before. Although this might suggest that we get a risk controlled model that performs as well as before, but is less conservative, consider that the FDR is slightly higher than before. ECE was also decreased as in the case of histogram binning alone, but D-ECE increased further.

The most surprising result was the combination of location dependent histogram binning with risk control. With this combination, the FNR was not controlled, and both ECE and D-ECE increased with respect to the baseline. Empiracally, we tested other \lambda values, and the even with a \lambda=1e^{-45}, the FNR was still not below the threshold. Only, after moving it to \lambda=1e^{-46}, was the FNR controlled. This means that the location dependent calibration method produced unstable calibrated confidences for the \lambda estimation procudere. We note that histogram binning can change the ranking of the predictions, and this may have had an effect on the risk control procedure. Again, we believe that the high number of empty bins may have played a part on this effect.

Observing the reliability diagram of Figure 6 (a), we seem the same trend as in the application of histogram binning alone, with a tendency for overconfident predictions. The same trend is even more pronounced in the location dependent calibration method, as seen in Figure 6 (b), with some bins reaching an error as high as 24\%. In this case, however, even the ECE is worse than the baseline model.

(a) Reliability diagram and assigned samples per bin after the application of **location independent histogram binning and risk control** with conformal prediction.

6 Conclusion

Although the confidence scores of machine learning models are often interpreted as real probabilities, research has shown that many models are poorly calibrated. We have tackled this issue in the specific application of FOD semantic segmentation, and have shown that the calibration error of a baseline uncalibrated model (a FCN) was fairly low. In some applications, it is desirable to control the risk associated with the model’s predictions, and we have shown that conformal prediction can be used to control the false negative rate of FOD segmentation, but with a penalty in confidence calibration. We successfully demonstrated that this model can be calibrated with histogram binning, and that the combination of risk control and calibration can be used in practice. Contrary to existing literature, we have shown that location dependent calibration methods can have a negative effect on the model’s performance, and that this effect can be exacerbated when combined with risk control.

In future work, this study can be extended to extended to shed light of the effect of model architecture, model capacity, transfer learning, augmentations, loss functions and other calibration techniques, on the confidence calibration and risk control of FOD semantic segmentation models. Furthermore, we note that a second labeling effort to track the same objects in different images could benefit the data splits, and that a larger dataset would allow for a more robust evaluation of the models.

7 References

Almeida, Joao, Gonçalo Cruz, Diogo Silva, and Tiago Oliveira. 2023. “Application of Deep Learning to the Detection of Foreign Object Debris at Aerodromes’ Movement Area.” In VISIGRAPP (5: VISAPP), 814–21. https://www.scitepress.org/Link.aspx?doi=10.5220/0011790600003417.

Angelopoulos, Anastasios N, Stephen Bates, et al. 2023. “Conformal Prediction: A Gentle Introduction.” Foundations and Trends in Machine Learning 16 (4): 494–591.

Angelopoulos, Anastasios Nikolas, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. 2024. “Conformal Risk Control.” In Proceedings of the the Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=33XGfHLtZg.

Brent, R. P. 1973. Algorithms for Minimization Without Derivatives. Englewood Cliffs, NJ: Prentice-Hall.

Campos, Margarida M, António Farinhas, Chrysoula Zerva, Mário AT Figueiredo, and André FT Martins. 2024. “Conformal Prediction for Natural Language Processing: A Survey.” arXiv Preprint arXiv:2405.01976.

Chen, Liang-Chieh, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017. “Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected Crfs.” IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4): 834–48.

Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. “Imagenet: A Large-Scale Hierarchical Image Database.” In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–55. Ieee.

“EASA Artificial Intelligence Roadmap 2.0 Published - A Human-Centric Approach to AI in Aviation EASA.” 2023. https://www.easa.europa.eu/en/newsroom-and-events/news/easa-artificial-intelligence-roadmap-20-published.

Gal, Yarin, and Zoubin Ghahramani. 2016. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In International Conference on Machine Learning, 1050–59. PMLR.

Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 2017. “On Calibration of Modern Neural Networks.” In International Conference on Machine Learning, 1321–30. PMLR.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–78.

Kull, Meelis, Telmo M. Silva Filho, and Peter Flach. 2017. “Beyond sigmoids: How to obtain well-calibrated probabilities from binary classifiers with beta calibration.” Electronic Journal of Statistics 11 (2): 5052–80. https://doi.org/10.1214/17-EJS1338SI.

Kull, Meelis, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. 2019. “Beyond Temperature Scaling: Obtaining Well-Calibrated Multi-Class Probabilities with Dirichlet Calibration.” Advances in Neural Information Processing Systems 32.

Kumar, Ananya, Percy S Liang, and Tengyu Ma. 2019. “Verified Uncertainty Calibration.” Advances in Neural Information Processing Systems 32.

Küppers, Fabian, Anselm Haselhoff, Jan Kronenberger, and Jonas Schneider. 2022. “Confidence Calibration for Object Detection and Segmentation.” In Deep Neural Networks and Data for Automated Driving: Robustness, Uncertainty Quantification, and Insights Towards Safety, 225–50. Springer International Publishing Cham.

Küppers, Fabian, Jan Kronenberger, Amirhossein Shantia, and Anselm Haselhoff. 2020. “Multivariate Confidence Calibration for Object Detection.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 326–27.

Lin, Tsung-Yi, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. “Focal Loss for Dense Object Detection.” In Proceedings of the IEEE International Conference on Computer Vision, 2980–88.

Long, Jonathan, Evan Shelhamer, and Trevor Darrell. 2015. “Fully Convolutional Networks for Semantic Segmentation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3431–40.

Minderer, Matthias, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic. 2021. “Revisiting the Calibration of Modern Neural Networks.” Advances in Neural Information Processing Systems 34: 15682–94.

Mukhoti, Jishnu, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip H. S. Torr, and Puneet K. Dokania. 2020a. “Calibrating Deep Neural Networks Using Focal Loss.” In Proceedings of the 34th International Conference on Neural Information Processing Systems, 15288–99. NIPS ’20. Red Hook, NY, USA: Curran Associates Inc.

Mukhoti, Jishnu, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip Torr, and Puneet Dokania. 2020b. “Calibrating Deep Neural Networks Using Focal Loss.” Advances in Neural Information Processing Systems 33: 15288–99.

Platt, John et al. 1999. “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods.” Advances in Large Margin Classifiers 10 (3): 61–74.

Rahman, Md Atiqur, and Yang Wang. 2016. “Optimizing Intersection-over-Union in Deep Neural Networks for Image Segmentation.” In International Symposium on Visual Computing, 234–44. Springer.

Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. 2015. “U-Net: Convolutional Networks for Biomedical Image Segmentation.” In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 234–41. Springer.

Sensoy, Murat, Lance Kaplan, and Melih Kandemir. 2018. “Evidential Deep Learning to Quantify Classification Uncertainty.” Advances in Neural Information Processing Systems 31.

Sudre, Carole H., Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M. Jorge Cardoso. 2017. “Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations.” In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, edited by M. Jorge Cardoso, Tal Arbel, Gustavo Carneiro, Tanveer Syeda-Mahmood, João Manuel R. S. Tavares, Mehdi Moradi, Andrew Bradley, et al., 240–48. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-67558-9_28.

Virtanen, Pauli, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, et al. 2020. “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python.” Nature Methods 17: 261–72. https://doi.org/10.1038/s41592-019-0686-2.

Vovk, Vladimir, Alexander Gammerman, and Glenn Shafer. 2005. Algorithmic Learning in a Random World. New York: Springer-Verlag. https://doi.org/10.1007/b106715.

Zadrozny, Bianca, and Charles Elkan. 2002. “Transforming Classifier Scores into Accurate Multiclass Probability Estimates.” In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 694–99.

Citation

BibTeX citation:

@online{silva2024,
  author = {Silva, Diogo},
  title = {Risk Controlled {FOD} Detection with Calibrated Semantic
    Segmentation},
  date = {2024-07-04},
  url = {https://www.diogoaos.com/blog/conformal_fod},
  langid = {en},
  abstract = {Machine learning has been increasingly used in
    safety-critical applications such as Foreign Object Debris (FOD)
    detection. For the purposes of risk management and integration in
    decision chains, it is crucial to have reliable confidence estimates
    for a model’s predictions. We propose a system that combines
    confidence calibration and risk control through conformal
    prediction, for FOD detection using binary semantic segmentation. We
    evaluated the performance of a popular segmentation architecture,
    and studied the effect of applying risk control and confidence
    calibration (both location independent and dependent) on the model’s
    performance, confidence calibration, decision boundary and the
    ability to control the false negative rate. Results suggest that
    histogram binning can be effective for confidence calibration, that
    conformal prediction can be used to control the risk of the model’s
    predictions, and that both can be combined to improve confidence
    calibration and risk control. The location dependent calibration
    method failed to produce the desired results in this application.}
}

For attribution, please cite this work as:

Silva, Diogo. 2024. “Risk Controlled FOD Detection with Calibrated Semantic Segmentation.” July 4, 2024. https://www.diogoaos.com/blog/conformal_fod.