Abstract

Deep Neural Networks (DNNs) exhibit exceptional performance in various tasks; however, their susceptibility to miscalibration poses challenges in healthcare applications, impacting reliability and trustworthiness. Label smoothing, which prefers soft targets based on uniform distribution over labels, is a widely used strategy to improve model calibration. We propose an improved strategy, Label Smoothing Plus (LS+), which uses class-specific prior that is estimated from validation set to account for current model calibration level. We evaluate the effectiveness of our approach by comparing it with state-of-the-art methods on three benchmark medical imaging datasets, using two different architectures and several performance and calibration metrics for the classification task. Experimental results show notable reduction in calibration error metrics with nominal improvement in performance compared to other approaches, suggesting that our proposed method provides more reliable prediction probabilities. Code is available at https://github.com/abhisheksambyal/lsplus.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3276_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3276_supp.pdf

Link to the Code Repository

https://github.com/abhisheksambyal/lsplus

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Sam_LS_MICCAI2024,
        author = { Sambyal, Abhishek Singh and Niyaz, Usma and Shrivastava, Saksham and Krishnan, Narayanan C. and Bathula, Deepti R.},
        title = { { LS+: Informed Label Smoothing for Improving Calibration in Medical Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The label-smoothing training procedure is known to yield a confidence calibrated model. The paper proposes a modification of label smoothing that replaces the uniform distribution with a distribution where the probability of the correct label k is the network average accuracy for class k computed on a validation set.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The method is very very simple. The reported experiments indicate improved results.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Calibration can be done either during training or as post-processing step performed on a validation set. The proposed method is done as a post-processing step where a validation set is used to find the class accuracy and then the model is retrained on the training set.

    If you assume using a validation set for calibration why not applying temperature scaling (or at least compare with it)?

    There are many relevant training details that are not reported. What loss function is used in the first training step? what value of alpha is used for LS and LS+?

    How does your approach work with unbalanced classes where rare classes are much less correctly predicted?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    see weaknesses

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The proposed method is based on two steps 1) standard training 2) training with label smoothing which is biased towards the correct class. The following two ablation studies should be made before assessing the advatage of the proposed method. 1) Use LS loss instead of LS++ in the second step. 2) In the smoothing distribution v^k replace the validation set class accuracy with a constant value (e.g. 0.6) that gives a constant bias towards the corrcet class. You can do it both as a single training step and as an alternative second step.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The are misisng implementation details that may be tailored towards improved results (see weaknesses).

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper describes an extension to Label Smoothing (Label Smoothing Plus) that eliminates the need to specify a smoothing value in favour of estimating it from the validation set. The smoothing value is estimated for each batch to determine how much adjustment is needed during training to improve calibration. The paper then demonstrated experiments with three medical imaging datasets (Chaoyang, MHIST, and ISIC-2018) and two neural network architectures (ResNet 34 and 50), comparing their extension to other “train-time” calibration methods that used ECE, ACE, SCE, and CE to measure calibration performance as well as retention curves and confidence density plots.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The extension is novel in that it addresses the limitations of the original method while also improving calibration performance by encouraging the model to produce better calibrated predictions based on its current performance on the validation set.

    The paper clearly explains the motivation behind the extension, as well as the limitations of label smoothing and how this extension addresses them. The contribution is also presented in algorithmic format.

    The experiments in this paper appear to be robust, with multiple datasets, more than one model architecture, and multiple measures of model performance and calibration reported, as well as standard deviations across multiple runs.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There has been a lot of previous work done to create variations of label smoothing, but this paper makes no mention of any other parallel developments.

    This method appears to require multiple evaluations of the model on the validation set during training, but the authors have not commented on the additional computational requirements.

    This paper used ECE throughout however the MICCAI community has mostly moved to using the more recent ECE KDE based on kernel density estimators instead of binning (see Metrics reloaded: Pitfalls and recommendations for image analysis validation).

    Most train-time calibration methods necessitate additional computation during training or the adjustment of more hyperparameters prior to training, which is why most research has favoured post-hoc calibration methods. The experiments presented in the paper only compare other train-time calibration methods; I believe it would be preferable to include other popular post-hoc methods in the experiments.

    There is no mention of dataset imbalance or how this method might behave in such a setting outside of the results discussion. I believe that having to respond to the validation set to adjust the calibration method may increase the method’s sensitivity to data imbalance.

    Only ResNet-based models were used; it would have been interesting to include more visual transformer-based models, which are becoming increasingly popular and exhibit very different calibration behaviours.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Some values for the methods used for comparison in the experiments are missing values needed to reproduce results. I expect these will be included in the source code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I think the experiments need to be expanded. In a world with lots of calibration metrics, I think its important to compare against the methods that other researchers are utilizing and use metrics that have become standard to assessing calibration.

    I also think it is important to assess the computational cost of this method, as it is not mentioned in the paper I imagine the cost is very high, which also explains the smaller than standard validation set. A discussion on this, I think, is important when trying to convince others to use this method.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a minor extension to an already existing method that offers little novelty. The experiments are robust, but there are no comparisons to any post-hoc calibration methods. The computational cost of this method is not discussed, despite the fact that it appears to be computationally expensive.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The paper is interesting, and the authors addressed most of the issues presented.



Review #3

  • Please describe the contribution of the paper

    This paper proposes Label Smoothing plus, a trainable calibration method for medical imaging classification. Label Smoothing smooths label vector by uniform ditributed vector. Different from Label Smoothing, Label Smoothing plus can smooth label vector by a informed class specific prior. This prior is calculated on validation set.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    It is effective and innovative to utilize the information of validation set to improve the calibration performance of Label smooething.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Since the main contribution of this paper is how to utilize the information of validation set to improve trainable calibration method, it is reasonable to compare the proposed method with other calibration methods, which require validation set, like temperature scaling.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Comparing with method like temperature scaling can make the paper more convincing.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The idea of utilizing validation set to improve trainable calibration method is interesting.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We are grateful to the reviewers for their time and effort. Their constructive comments have been very helpful in revising our manuscript and improving its quality. Addressing the reviewers concerns (below) further emphasized the simplicity yet effectiveness of our proposed model. (Code & Supplementary results requested by the reviewers will be available on Github post-acceptance) Q1 Comparison with temperature scaling. R1, R2, R3: We used temperature scaling on the Chaoyang & MHIST datasets with ResNet34/50, and our method showed the best calibration performance. Experimental details will be shared post-acceptance. Q2 High computational cost. R2: Training LS+ has the same cost as standard LS. Our method does not require multiple validations. V^{acc} are the predictions from a hard label (HL) model, computed only once before LS+ training, with negligible cost at inference. Q3 Sensitivity to class imbalance. R1, R2: In case of data imbalance, while the accuracy of rare classes tends to be low, the predicted probabilities are generally high. LS+ reduces the predicted probabilities by aligning them with validation set accuracies using KL to achieve better calibration. We have included experiments on an imbalanced dataset (ISIC; Classes=7) (Page No. 6) to show that LS+ improves model calibration. The retention curves in Fig. 1 and the results in Table 1-D3, indicate that despite data imbalance, the model exhibits more reliable behavior and is less sensitive to data imbalance. Q4 Missing implementation details R1, R2: For LS+, we used alpha=0.5; Cross entropy loss function is used in the first training step; Other implementation details are referenced from [7, 18]; We will provide the complete code including ablation studies requested by the reviewers, after acceptance. Q5 Comparison with standard metrics for calibration including ECE KDE. R2: In our experiments, we used standard metrics to assess calibration (ECE, SCE, ACE, Brier, NLL) as cited in 2023, MICCAI [1, 2, 3]. Following the reviewer’s suggestion, in the given time limit we also compared the ECE KDE metric on the MHIST dataset, and our method demonstrated the best calibration.

  1. (Brier/ECE)Maximum Entropy on Erroneous Predictions.
  2. (ECE/NLL)Multi-Head Multi-Loss Model Calibration.
  3. (ECE/SCE)Trust Your Neighbors: Penalty-BasedConstraints for Model Calibration. Q6 Suggested ablation Studies: A) Standard training followed by fine-tuning with standard LS approach. B) Replace the validation set class accuracy with a constant value (e.g. 0.6). R1: A) We would like to clarify that V^{acc} is computed only once from the HL(Hard Label)-trained model and is used throughout the training of LS+. The HL-trained model itself is not used for fine-tuning with LS+. As suggested by the reviewer, we performed experiments with the HL-trained model fine-tuned on LS and observed degraded calibration results compared to our approach on both Chaoyang/ MHIST dataset using R34/R50 architectures. B) Experimenting with the reviewer’s suggestion of replacing V^{acc} with a constant value (0.4/0.6), we observed significant degradation in the model’s calibration. Manually finding the correct value is difficult and requires extensive trial and error, whereas LS+ automatically finds the correct value using the validation set. Q7 Comparison with other variants of LS. R2: Numerous works on LS variations primarily aim to enhance model performance, with only a few focusing on calibration. We discussed approaches focused on improving calibration, beyond LS variants. Our aim is to emphasize that our method’s performance and calibration are comparable to or better than SOTA calibration approaches, not just LS methods. Furthermore, as per suggestion of the reviewer, we will expand the scope of our experiments, by comparing with other LS variants, such as margin-based label smoothing, in future. Q8 Effect on transformer-based models. R2: As our future work,we will explore LS+ with transformer-based models.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Two reviewers gave this paper a score of weak rejection. One of them raised it to weak acceptance, the other reviewer did not come back after rebuttal for discussion. After reading authors’ feedback, I found it convincing. Even if the paper is not a major contribution to the state-of-the-art, it appears that it is good enough for acceptance into MICCAI’s main track.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Two reviewers gave this paper a score of weak rejection. One of them raised it to weak acceptance, the other reviewer did not come back after rebuttal for discussion. After reading authors’ feedback, I found it convincing. Even if the paper is not a major contribution to the state-of-the-art, it appears that it is good enough for acceptance into MICCAI’s main track.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The concept is straightforward yet effective, and the rebuttal addressed the reviewers’ concerns.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The concept is straightforward yet effective, and the rebuttal addressed the reviewers’ concerns.



back to top