Abstract

Deep neural networks for medical image segmentation often produce overconfident results misaligned with empirical observations. Such miscalibration, challenges their clinical translation. We propose to use marginal L1 average calibration error (mL1-ACE) as a novel auxiliary loss function to improve pixel-wise calibration without compromising segmentation quality. We show that this loss, despite using hard binning, is directly differentiable, bypassing the need for approximate but differentiable surrogate or soft binning approaches. Our work also introduces the concept of dataset reliability histograms which generalises standard reliability diagrams for refined visual assessment of calibration in semantic segmentation aggregated at the dataset level. Using mL1-ACE, we reduce average and maximum calibration error by 45% and 55% respectively, maintaining a Dice score of 87% on the BraTS 2021 dataset. We share our code here: https://github.com/cai4cai/ACE-DLIRIS.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3075_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3075_supp.pdf

Link to the Code Repository

https://github.com/cai4cai/ACE-DLIRIS

Link to the Dataset(s)

https://www.med.upenn.edu/cbica/brats2021/

BibTex

@InProceedings{Bar_Average_MICCAI2024,
        author = { Barfoot, Theodore and Garcia Peraza Herrera, Luis C. and Glocker, Ben and Vercauteren, Tom},
        title = { { Average Calibration Error: A Differentiable Loss for Improved Reliability in Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper addresses the issue of inappropriate model calibration in deep learning for medical image segmentation. Calibration issues in medical imaging is particularly important because overconfidence could lead to significant risks to patients. They present a novel loss function called marginal L1 Average Calibration Error (mL1-ACE) to address overconfidence in standard segmentation loss functions like Dice loss. mL1-ACE offers two advantages: (1) being differentiable despite the hard-binning of probabilities and (2) class-wise calibrations to address class imbalance. It performs favorably on BraTS2021 when paired with Dice Loss. The second contribution of the paper is new dataset reliability histograms that can serve as an additional visual tool for assessing calibration in large image segmentation datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The mL1-ACE loss can add to any loss function to improve the calibration. It is differentiable and works with multi-class data. The authors test it with multiple loss function configurations to show its adaptable use.
    2. The dataset reliability histograms are an interesting approach that provide visual information that is not present in other calibration papers. The differences between the losses calibration is apparent in these figures. Public code for these graphs, which they promise to provide in the camera-ready version, would be helpful.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The mL1-ACE loss does not seem to be that different than the ACE equation given from Reference 20: Neumann, L., Zisserman, A., Vedaldi, A.: Relaxed softmax: efficient confidence auto-calibration for safe pedestrian detection. In: 2018 NIPS MLITS Workshop: Machine Learning for Intelligent Transportation System. OpenReview (2018). The only real difference is that the metric was changed to have summation over all classes and be converted into a loss function. This makes me feel like the loss function is not as novel in its contribution compared to the other source.
    2. The writing issues in the paper severely impact its clarity (see critiques for examples).
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?
    1. They provide a line that suggests that they will provide the code in the camera-ready version.
    2. They provide a good description of training parameters, computational requirements, and data splits.
    3. The dataset is the publicly available BraTS dataset.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The overall paper needs to have its writing substantially improved. There are seemingly random comma placement in sentences that do not need them. This harms the paper readability. Examples: The sentence in the abstract that says “Such miscalibration, challenges their clinical translation” does not need a comma. The second sentence in the introduction also does not need the comma after the word “capacity”. The sentence “The Dice Similarity Coefficient (DSC) loss, is a popular…” does not need the comma after “loss”. The last paragraph in section 2 is very hard to read due to the writing quality.
    2. The authors could define what the sub-equations from equation 2 mean for someone less familiar with calibration could follow.
    3. Does writing “statistically significant difference (p < 0.01)” in Fig. 3. Mean that all the subplots were significant? The text seems like it might follow this format, but it is not clear within the figure itself.
    4. It seems like the best calibration in DSC-CE methods in Table 2 were not all highlighted. They should be highlighted for consistency to other methods in this Table.
    5. I can see that Figure 4 shows that the false positive and false negative regions are less confident, but it also looks like the overall false regions are bigger/greater. Is this true?
    6. Minor: the text refers to the appendix instead of the supplementary material.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think calibration is an important topic that needs more focus in the deep learning community. The results in this paper are good and the promised code repository would be helpful for researchers who are interested in this area. However, I question the novelty of the loss function. Also, the paper writing needs to be polished. Writing improvements may help make some of the contributions clearer.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I have increased my score after reading the author rebuttal. My main issues with this paper are novelty concerns and writing concerns; however, I don’t want to limit an interesting calibration paper due to writing issues. I am glad that the authors will improve the writing for the final version. I mostly changed my review because of the author’s argument for novelty. Their argument of being the first to use ACE as a loss and the advantages of the segmentation space make sense to me (as someone who has also extensively compared calibration in classification and segmentation). I also agreed with the other reviewer that explaining the differentiability of their loss in more detail will be helpful. I am glad for that to be included.



Review #2

  • Please describe the contribution of the paper

    The paper introduces an auxiliary loss for improving pixel-wise calibration, which helps to reduce maximum and average calibration errors by ~50%, demonstrated on BraTS 2021 segmentation without compromising task performance when trained with DSC loss, while there’s limited improvement in calibration performance when trained with CE loss. The method introduced is a directly differentiable difference between confidence and accuracy (DCA) applicable to multi-class segmentation. The paper also extends the concept of reliability diagrams to dataset reliability histograms.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Well-formulated problem, motivation, and related work on calibration in segmentation. The improvement in average calibration performance is significant compared to the loss without the proposed auxiliary loss but limited to models trained with dice similarity coefficients - which have been shown to drive the model to be overconfident (and hence uncalibrated). However, calibration for models trained with cross-entropy remains relatively constant.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors mention that equations 1, 3, and 4 are differentiable almost everywhere; how many bins were selected for these equations? How do the authors ensure that the function is indeed differentiable everywhere? There’s a discontinuity at the step between each bin; how is equation 3 differentiable at that junction?
    • The statistical significance (t-test) has been shown to be limited in the cases of model performance comparison, as the t-test assumed the data to be independent. It’s more relevant to use Bayesian comparison testing to test for significance see reference [1].

    Reference

    1. Time for a Change: a Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis. JMLR 2017
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Question:

    • This question is slightly out of scope, but I would like to know if the authors have insight into how the proposed loss would compare to the performance of label smoothing. Would there be a benefit of an auxiliary loss calibration penalty vs. label smoothing?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well formulated and written; however, a crucial aspect regarding the differentiability of the proposed auxiliary loss needs clarification as the authors did not provide a mathematical proof or explanation as to how the binning function would be differentiable at the bin junctions without the addition of a smoothing function/or approximations.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors have sufficiently addressed the reviewers’ comments. While the method presented primarily benefits calibration when the DSC is used as a loss function, it still holds significant value as exploratory research. In the medical imaging community, DSC is frequently employed in segmentation losses despite its tendency to produce overconfident predictions. This work, therefore, offers a valuable approach to improving model calibration without compromising model performance.



Review #3

  • Please describe the contribution of the paper

    This paper propose three auxiliary loss terms for training-time calibration in image segmentation. The method was evaluated on BraTS brain tumour segmentation. The authors showed that calibration improved while largely maintaining segmentation performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is very well written and concise and clear. It was a pleasure to read.
    • The proposed loss terms are simple, yet well motivated and effective.
    • The experimental setup is sound and allows a meaningful assessment of the paper’s contribution. In addition, the data and experiment setup were described in sufficient detail.
    • The authors propose an original and useful way to aggregate image-level reliability diagrams to dataset-level reliability diagram.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors compare only to the post-hoc Temperature Scaling calibration method, despite an extensive discussion of related work on train-time calibration. The authors explain that some methods (e.g. DECE [3]) may be difficult to reproduce. However, could the authors justify why they did not compare against at least one training-time calibration method?

    • Fig. 3 is perhaps slightly misleading, as only the results for the DSC loss are chosen, the only configuration where adding the calibration term did not lead to reduced performance, but not the highest-performing model overall. Could the authors perhaps add the other results as well? On the other hand, does Fig. 3 actually add relevant information beyond Tables 1+2?

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    I appreciate that code publication was promised. In addition, I believe the paper already offers a lot of detail that should help reproduce the method and results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Minor:

    • I believe the Reference to [5] for “extensive use of reliability diagrams for classification” is slightly misplaced.

    • Fig. 3 design could be improved (make tick labels legible and markers thicker, consider removing grid).

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a strong paper. I would have appreciated comparison to perhaps another train-time calibration method, but the experimental evaluation is sound otherwise. I would appreciate if the authors could comment on my concerns, but I anyway recommend acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I believe the authors addressed both my and the other reviewers’ comments well, in particular the concerns raised by R5. Therefore, my recommendation to accept the paper remains unchanged.




Author Feedback

We thank the reviewers for their insightful comments and constructive feedback.

NOVELTY (R5) While the concept of average calibration error (ACE) is not new, our key contribution lies in adapting it as a directly differentiable auxiliary loss for semantic segmentation, a novel approach not addressed in prior literature. This provides a major distinction from the use of ACE as a metric in classification tasks. The transition from classification to segmentation is a key aspect which we are the first to identify, enabling ACE to be used as a loss rather than a metric. Classification training pipelines provide too few samples within a mini-batch to meaningfully compute its ACE (in contrast to segmentation where each voxel provides a sample). We will make this clearer in the final version.

DIFFERENTIABILITY (R4) Equations 1, 3, and 5 are differentiable almost everywhere as the set of bin edges is of measure zero. In practice, we observe that discontinuities at bin junctions do not significantly affect gradient updates. These are based on averaging within bins rather than across bins and benefit from the large number of voxels per bin. Our approach thus ensures sufficient differentiability for practical purposes. Most neural networks, such as those using ReLU, contain gradient discontinuities. Training convergence for such non-smooth neural networks has been demonstrated under mild conditions [Allen-Zhu ICML 2019]. While we empirically observe convergence, we agree that demonstrating that our loss function meets these mathematical conditions would be insightful for future work. We further highlight that the calibration error calculated by our proposed loss is not significantly affected by the number of bins used, as evidenced in supplementary Fig. A1.

SOTA COMPARISON (R3, R4) Direct application of DECE (R3) and other prior train-time methods were not performed due to fundamental differences between classification and semantic segmentation rather than difficulty in reproducibility (cf. NOVELTY paragraph). Previous methods cannot directly estimate calibration from a mini-batch and thus exploit surrogate approaches like DECE which relies on meta learning. This makes analysing comparisons more challenging as it implies many more changes than just the loss function.

Label smoothing (R4) applies a pre-determined ground-truth softening resulting in an overall reduction in confidence and a knock-on but uncontrolled effect on calibration quality. In contrast, our method allows us to monitor and train for calibration explicitly, providing a tailored approach to improving model reliability.

Despite this, we recognize the value of additional comparisons and plan to include them in a future journal extension.

STATISTICAL TESTING (R4) We recognize that t-tests assume data independence and have limitations in this context. However, t-tests remain standard practice in the community and provide useful insight. We appreciate the suggestion and will consider Bayesian methods in future work to enhance statistical rigour.

RESULTS REPORTING (R3, R5) Figure 3 focuses on the DSC loss because adding plots from other losses led to significant plot overlap, making them difficult to interpret (R3). The DSC loss results were visually informative and central to our paper and were thus highlighted. We nonetheless present all results transparently in Tables 1 and 2. We will revise Fig. 3 for clarity, ensuring legible tick labels and thicker markers.

The regions shown in Fig. 4 differ in size as they represent predictions from different models (R5). Each model has unique predictions, resulting in varying false positive and false negative regions.

MANUSCRIPT ORGANISATION (R5) Conflicting opinions on organisation and clarity were provided, with R3 and R4 providing very positive feedback. However, we appreciate R5 comments and will revise the manuscript to remove unnecessary commas and improve readability. We will define sub-equations to enhance clarity.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All reviewers now recommend acceptance. The only initially skeptical reviewer was happy with the compromise that the authors made to improve writing, so please try to incorporate their feedback as much as possible to the final version. Congratulations!

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    All reviewers now recommend acceptance. The only initially skeptical reviewer was happy with the compromise that the authors made to improve writing, so please try to incorporate their feedback as much as possible to the final version. Congratulations!



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All reviewers have agreed to accept this paper based on its novelty in image segmentation calibration.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    All reviewers have agreed to accept this paper based on its novelty in image segmentation calibration.



back to top