Abstract

Effective confidence estimation is desired for image classifi- cation tasks like clinical diagnosis based on medical imaging. However, it is well known that modern neural networks often show over-confidence in their predictions. Deep Ensemble (DE) is one of the state-of-the-art methods to estimate reliable confidence. In this work, we observed that DE sometimes harms the confidence estimation due to relatively lower confidence output for correctly classified samples. Motivated by the observation that a doctor often refers to other doctors’ opinions to adjust the confidence for his or her own decision, we propose a simple but effective post-hoc confidence estimation method called Deep Model Reference (DMR). Specifically, DMR employs one individual model to make decision while a group of individual models to help estimate the confidence for its decision. Rigorous proof and extensive empirical evaluations show that DMR achieves superior performance in confidence estimation compared to DE and other state-of-the-art methods, making trustworthy image classification more practical. Source code is available at https://openi.pcl.ac.cn/OpenMedIA/MICCAI2024_DMR.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2136_paper.pdf

SharedIt Link: https://rdcu.be/dV53V

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72117-5_17

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2136_supp.pdf

Link to the Code Repository

https://openi.pcl.ac.cn/OpenMedIA/MICCAI2024_DMR

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Zhe_Deep_MICCAI2024,
        author = { Zheng, Yuanhang and Qiu, Yiqiao and Che, Haoxuan and Chen, Hao and Zheng, Wei-Shi and Wang, Ruixuan},
        title = { { Deep Model Reference: Simple yet Effective Confidence Estimation for Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {175 -- 185}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper addresses the issue of accurately estimating the confidence of a prediction provided by a classification network. it proposes to train an auxiliary set of classifiers and set the confidence as the average probability of the predicted class across all the networks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper adressesan important problem of accuracte confidence estimation.

    A comparison to previous methods shows improved results.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    It seems to me a bit strange to train a whole set of networks solely for the purpose of confidence estimation. Using the networks for prediction (DE) you can gain a much better accuracy.

    The standard way to measure the confidence accuracy (aka calibration) is Expected Calibration Error (ECE). Why not using ECE to evaluate the propose method?

    The standard way to make the confidence more accurate is Temperature Scaling (TS). TS is a posthoc calibration procedure that dont change the accuracy. It make sense to compare the proposed method to TS.

    Neural networks tend to be over confident. It seems to me that the main reason that DMR is more calibrated than DE is because its accuracy (which is based on a single network) is worse (than accuracy which is based on ensemble of networks).

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Confidence calibration is a well studied reseach topic which is closely related to the paper’s topic. It should be mentioned in the related works section and you should compare your method to standard confidence calibration methods.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    lack of comparison. It seems strange to train a whole set of networks solely for the purpose of confidence estimation.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors provided satisfacory answers to my questions. Mainly regarding the positioning of the paper.



Review #2

  • Please describe the contribution of the paper

    The authors propose a new method called Deep Model Reference(DMR) for misclassification detection. The theoretical proof is provided to justify why DMR can be equivalent to or outperform Deep Ensemble(DE) in misclassification detection. Experiments are carried out to compare the DMR performances to other methods across different types of neural network architectures and datasets, even when under distribution shift.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • theoretical proof is provided
    • supplementary materials with the code and data
    • better results than DE methods for misclassification detection
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • requires extra-computational cost
    • less accuracy
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The submission has provided supplementary materials with the code and data.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors propose a new method called Deep Model Reference(DMR) for misclassification detection. The theoretical proof is provided to justify why DMR can be equivalent to or outperform Deep Ensemble(DE) in misclassification detection. Experiments are carried out to compare the DMR performances to other methods across different types of neural network architectures and datasets, even when under distribution shift. The method was evaluated on 3 natural image datasets and 2 medical images datasets with different neural network architectures. P It obtained better results than DE methods for misclassification detection but the authors says that it “slightly sacrifices the accuracy” but no comparison results on accuracy are provided. It could be interesting to add it.

    Could you add information on computational cost for the different methods?

    Experiment under distribution shift are carried out on natural images. For a publication in the MICCAI conference, it could be more interesting to carry out these experiments on medical images.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method proposed for misclassification detection is interesting, but some additional information are required (accuracy, computing costs). Moreover it could be interesting for the MICCAI audience to focus more on medical images than on natural images (in particular for the experiments under distribution shift).

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The answers of the authors are convincing



Review #3

  • Please describe the contribution of the paper

    This study introduces a novel aggregation function designed to enhance the calibration of confidence scores in ensemble predictions, ensuring they more accurately indicate the likelihood of classification errors. Unlike conventional methods that aggregate individual confidence scores using an arithmetic mean and select the class with the highest score for the prediction, this method selects a random ensemble member to make the prediction and uses as final confidence score is calculated as the arithmetic mean of the scores from all ensemble members that correspond to the class predicted by the selected member. When the prediction of the selected member matches the ensemble prediction (using the traditional aggregation approach), the confidence score remains unchanged. However, if the predictions differ, the confidence score decreases, effectively reducing overconfidence in cases of ambiguity.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Simplicity: The revision to the aggregation function is a straightforward post-hoc adjustment that does not complicate the modeling or inference processes, yet it offers an advantageous alternative as demonstrated by the authors.

    • Evidence of Enhanced Predictive Accuracy: The authors have conducted both theoretical and empirical analyses that show improvements in predictive accuracy with the new aggregation function across multiple benchmarks.

    Universality: This method is versatile and not confined to specific tasks, making it applicable across a variety of medical classification scenarios.

    Topic Significance: Calibrating confidence is key to model reliability, enhancing risk-cost analysis in medical decision-making and boosting transparency and trust in ML predictions for both medical personnel and patients.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Inconsistent Performance Assumption: The theoretical foundation presumes that individual members of an ensemble generally exhibit lower accuracy than the ensemble as a whole. While ensembles often reduce model bias, this effect is not guaranteed. Primarily, these methods aim to decrease predictive variance, with any reduction in bias usually being a secondary effect. The averaging process inherent in ensembles may sometimes prevent reaching the peak performance of a highly optimized single model.

    • Limited Scope of Ensemble Evaluation: The empirical evaluation and sensitivity analysis focus on an ensemble with a limited number of members (up to 10), which is justifiable due to the high computational expense of re-training neural networks. Nonetheless, the study does not explore the use of common deep learning ensemble methods involving network dilution techniques, like MC-dropout, which facilitate the creation of larger, correlated ensembles.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Overall, the discussed topic is important, the study’s motivation is clear, and the suggested approach provide a nice addition to the current literature.

    A few comments/suggestions:

    • Figure 1. The labels “more/less/moderate correct …” are somewhat unclear. While the authors aimed to offer a comparative analysis of the methods across subplots, I found this confusing. The distribution plots and the legend are self-explanatory.

    • Figures 2, 3, and 4 are missing error bars or other form of variance estimation.

    • There is a gap in the discussion regarding cases where an ensemble member selected at random either significantly outperforms or underperforms in accuracy compared to the overall ensemble prediction. This problem is somewhat obscured in the experimental evaluation due to the small size of the ensembles.

    • How might highly correlated ensembles, such as those produced by popular ensembling methods like MC-dropout, impact the effectiveness of this method?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Ease of Implementation: The proposed method is easy to implement as it does not require modifications to the trained model and offers an alternative way to aggregate ensemble predictions.

    Theoretical Efficacy: The authors present a theoretical analysis that highlights the effectiveness of the suggested approach in calibrating confidence scores when individual members underperform compared to the ensemble.

    Empirical Evaluation: While the authors conduct evaluations across various use cases and network architectures, they fail to explore highly correlated ensembles formed using common methods like Monte Carlo dropout.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Although some of my concerns remain, the paper demonstrates merit and makes a unique contribution.




Author Feedback

Response to Reviewer #1: Q1: Strage to train a set of networks for confidence estimation. A1: Obtaining reliable confidence estimates is crucial in risk-sensitive applications such as clinical diagnosis. This paper provides theorectical and empirical evidence to support that a set of neural networks can help provide more accurate estimate of confidence. While Deep Ensemble (DE) may achieve slightly better classification accuracy than individual classifier (IC), confidence estimates from DE is worse than our method for IC. In clinical diagnosis, reliable confidence estimation is often more concerned when DE and IC have comparable classification accuracy.

Q2: Why not use ECE to evaluate the method? A2: The goal of our study is to obtain a reliable confidence estimate. Calibration and misclassification detection are two frameworks used to assess the reliability of confidence estimates. As pointed output in Ref [31], calibration focuses on aligning accuracy and confidence, while misclassification detection emphasizes the separability between correctly classified and misclassified samples. It is important to note that a model with the best ECE score may not effectively distinguish between correctly classified and misclassified samples [Ref. 31]. We consider misclassification detection as a more practical way to measure the reliability of confidence estimation. This is because human experts, such as doctors, can pay more attention to misclassified samples (e.g., misdiagnosed patients), thereby avoiding potentially disastrous consequences. Therefore, our primary focus here is on misclassification detection and ECE is not a proper metric under such a framework.

Q3: Why not compare the method to Temperature Scaling (TS)? A3: As presented in Ref [31], Temperature Scaling (TS) does not improve misclassification detection. This is probably because TS divides the logit output by a temperature T, thereby increasing or decreasing the confidence for both correctly classified and misclassified samples by the same scale. TS is an effective method for improving ECE because ECE focuses on aligning accuracy and confidence, and the optimal temperature obtained from a validation set can make the average confidence match the average accuracy within a bin. Therefore, TS is a standard baseline under calibration framework but not a proper baseline under the misclassification detection framework.

Q4: DMR is more calibrated than DE because its accuracy is worse. A4: Classifier’s accuracy is not correlated with estimated confidence. The estimated confidence by the MSP (Maximum Softmax Probability) for individual classifier is inferior to that of DE. In contrast, the estimated confidence by our DMR for individual classifier is better than DE.

Q5: The submission does not mention open access to source code. A5: We have mentioned to relase the source code in Abstract.

Response to reviewer #3: Q6: Add accuracy and computational cost. A6: Thank you for the suggestion! We will add them in the new version.

Q7: Perform experiment on medical image dataset under distribution shift. A7: Thank you for the suggestion! We will do it.

Response to reviewer #4: Q8: Inconsistent Performance Assumption A8: We would like to clarify that, our theoretical foundation assumes that not all individual members but only the randomly selected member shows lower or equivalent accuracy than that of the ensemble. This is often valid in practice. More discussion and relevant experiments will be performed for the randomly selected member with significantly lower or higher accuracy compared to the ensemble model.

Q9: Limited Scope of Ensemble Evaluation A9: We appreciate the suggestion and will perform experiments on correlated ensembles formed by methods like MC-dropout (with positive results expected to be achieved).

Other comments will also be carefully considered for further improvement of the paper.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper got three weak recommendations for acceptance, and one was initially a weak rejection, so I will back acceptance.

    I agree with R3 on that natural images are not very relevant for this conference, and it makes me feel like the method is useful for medical applications only as an afterthought (using busi and covid datasets, not a single visual example in the paper, this does not help in fixing this feeling). Anyway, the method is simple, and I found authors’s response clarifying the difference between calibration and failure prediciton quite reasonable, so I don’t find a strong reason to go against the main trend of recommendations here.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper got three weak recommendations for acceptance, and one was initially a weak rejection, so I will back acceptance.

    I agree with R3 on that natural images are not very relevant for this conference, and it makes me feel like the method is useful for medical applications only as an afterthought (using busi and covid datasets, not a single visual example in the paper, this does not help in fixing this feeling). Anyway, the method is simple, and I found authors’s response clarifying the difference between calibration and failure prediciton quite reasonable, so I don’t find a strong reason to go against the main trend of recommendations here.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper proposes a method to estimate the confidence of a classifier. This is an important topic. The proposed approach is quite simple but this is not a drawback in itself. The approach shows good results even though the experiments could have been more extensive and more focused on medical data. The authors need to clarify the positioning in the final version.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper proposes a method to estimate the confidence of a classifier. This is an important topic. The proposed approach is quite simple but this is not a drawback in itself. The approach shows good results even though the experiments could have been more extensive and more focused on medical data. The authors need to clarify the positioning in the final version.



back to top