Abstract

Medical imaging is spearheading the AI transformation of healthcare. Performance reporting is key to determine which methods should be translated into clinical practice. Frequently, broad conclusions are simply derived from mean performance values. In this paper, we argue that this common practice is often a misleading simplification as it ignores performance variability. Our contribution is threefold. (1) Analyzing all MICCAI segmentation papers (n = 221) published in 2023, we first observe that more than 50% of papers do not assess performance variability at all. Moreover, only one (0.5%) paper reported confidence intervals (CIs) for model performance. (2) To address the reporting bottleneck, we show that the unreported standard deviation (SD) in segmentation papers can be approximated by a second-order polynomial function of the mean Dice similarity coefficient (DSC). Based on external validation data from 56 previous MICCAI challenges, we demonstrate that this approximation can accurately reconstruct the CI of a method using information provided in publications. (3) Finally, we reconstructed 95% CIs around the mean DSC of MICCAI 2023 segmentation papers. The median CI width was 0.03 which is three times larger than the median performance gap between the first and second ranked method. For more than 60% of papers, the mean performance of the second-ranked method was within the CI of the first-ranked method. We conclude that current publications typically do not provide sufficient evidence to support which models could potentially be translated into clinical practice.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3400_paper.pdf

SharedIt Link: https://rdcu.be/dV53Q

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72117-5_12

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3400_supp.pdf

Link to the Code Repository

https://github.com/IMSY-DKFZ/CI_uncovered

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Chr_Confidence_MICCAI2024,
        author = { Christodoulou, Evangelia and Reinke, Annika and Houhou, Rola and Kalinowski, Piotr and Erkan, Selen and Sudre, Carole H. and Burgos, Ninon and Boutaj, Sofiène and Loizillon, Sophie and Solal, Maëlys and Rieke, Nicola and Cheplygina, Veronika and Antonelli, Michela and Mayer, Leon D. and Tizabi, Minu D. and Cardoso, M. Jorge and Simpson, Amber and Jäger, Paul F. and Kopp-Schneider, Annette and Varoquaux, Gaël and Colliot, Olivier and Maier-Hein, Lena},
        title = { { Confidence intervals uncovered: Are we ready for real-world medical imaging AI? } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {124 -- 132}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper analyses the MICCAI segmentation papers from 2023, concluding that 50% of the papers to not assess performance variability in any manner, with only 0.5% reporting confidence intervals. The authors suggest to address this issue by predicting the missing standard deviation based on the reported mean Dice via a second order polynomial function. Finally, by reconstructing 95% confidence intervals around the mean Dice for the MICCAI papers, they conclude that for the majority the performance gaps are not as large and clear as expected based solely on the mean value, reiterating the importance of reporting variability measures.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • the addressed issue is quite relevant, since the reporting of the variability of the methods is of major relevance to assess the possibility of translation of the automatic methods to the clinical practice
    • the analysis performed in the paper is sound
    • the paper is well written
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The proposed approach for overcoming the lack of reporting of variability is not very explored and I don’t feel it has been sufficiently validated; no objective metrics are computed, such as the coefficient of determination between the predicted and true variability metrics
    • in the Discussion the
    • the authors seem to addressing of the variability coming on the test set, but the variability of the learning procedure (such as coming from different network initialisation), which is referred in the Discussion, is also very (if not more) relevant
    • organisation is overall ok, but the related work is misplaced since it comes in the Discussion instead of in the Introduction
    • no conclusions section
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • in the Discussion the authors refer the issue of the large enough sample size leading to significant differences even when this should not be the case. However, there are approaches to mitigate this issue, such as reporting besides the p-value as well the Effect size (e.g. Cohen’s d), which is not referred.
    • the Related work should come in the end of Introduction instead of in the Discussion as it is now
    • in section 2.2, it is not completely clear to me how the SD was computed; and is this the SD for each individual segmentation task? Please clarify
    • the authors should have computed objective and quantitative metrics on the performance of the proposed method - how does the estimated SD/CI correlate with the true ones? Instead of just showing the plot from Fig.4a., it would have been nice to show for instance an R2 value
    • the authors should add a Conclusions section
    • remove the sentence repeated in the end of paragraph 1 from page 4 (1st of the Methods section)
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper addresses a relevant topic in the development of automatic methods for medical image analysis and assisted diagnosis, and it is well written. The proposed approach for overcoming the lack of reporting of variability is insufficiently technically justified or validated in my opinion. I am not completely confident whether the contributions of the paper justify the acceptance for publication on MICCAI.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors describe the lack of reporting of algorithmic variability, which is essential to describe the robustness of a segmentation model and to assess claims regarding superioriry against previously proposed approaches. They analysed the situation for Miccai’23 segmentation papers, and also propose a prediction model of SD based on DSC to estimate variability on papers where variability was not reported.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    I am very glad the authors performed this study, since robustness is ultimately essential to see AI systems deployed, adopted and trusted by clinicians. The paper is well written and deserves a place in the conference to improve best practices we see in the community.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Some area of discussion/potential-improvements:

    • I think a missing link here is the inclusion of inter-rater variability in comparing methods. Ultimately, we want these systems to perform the tedious task of image segmentation. Depending on the task, the inter-rater variability varies; hence, talking about superiority should, in my opinion, also be put in the context of human performance or the system’s performance specifications. This maybe goes beyond the paper, but ultimately metrics should guide us to define sufficiency. As mentioned by someone, one can spend an entire academic career trying to get DSC=1.0. Providing a bigger picture (especially to those young researchers) is vital to avoid falling into that trap, and I think this message could be added to this paper.
    • In relation to the previous comment. I’d also suggest the authors to note that no single-metric is perfect, and hence, reporting variability across different metrics is important. Mentioning this since the paper focuses on DSC, which can give the wrong impression to a newcomer in the field.
    • One potential (minor) limitation is that the paper focuses on Miccai papers which are short in nature (oh those margins!). What’s the current situation across our main journals? As a reviewer, when not present, I consistently ask for SD values in Tables cause of the exact goal of this paper.

    Minors: See repeated “domain” in section 2

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I do not have further comments than those already provided above. The paper presents a clean and important message. I hope the authors can analyze the situation of journal papers that typically have more space and where reviewers might have more direct leverage to ask that information.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Clean and clear message. The paper might focus on miccai segmentation papers, but probably it also reflects the more general tendency in reporting results.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper has analyzed the MICCAI 2023 segmentation papers and looked into the performance variability such as confidence intervals, standard deviation reported by the papers on medical image segmentation, and constructed the 95% CI around the mean dice scores.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    In the age of the segmentation models that are trained using billions of natural images and using the for the medical image segmentation task, without the model being explainable, the importance of reporting the performance variability is not emphasized enough. I believe this paper did a fantastic job on analyzing how many of the segmentation paper just rely on reporting mean dice score and how this can misleading, as it will not explain the generalizability of these models.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Only analyzing the 2023 MICCAI papers.
    2. No information on the categories and the datasets of the images that are segmented. reporting the performance variability often varies in different domains, and there are reasons that some works avoid reporting it.
    3. No division of the type of the models.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. the paper should look beyond only the 2023 MICCAI papers.
    2. there should be a division of the categories and the datasets of the images that are segmented. reporting the performance variability often varies in different domains, and there are reasons that some works avoid reporting it.
    3. there should be a division of the type of the models. In that case, we will have additional information that can be insightful for some state-of-the-art models
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The importance of the work and how it explains most of the segmentation work that is translated from the natural image domains into the medical field.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We sincerely thank the reviewers for their support of our work and the valuable comments.

Response to criticism:

Restriction to DSC (R3): We fully agree that the reporting of a single metric is rarely sufficient (see also the Metrics Reloaded framework). In this study, we focused on the DSC because it is the most widely used metric in biomedical image analysis (https://www.nature.com/articles/s41467-018-07619-7) and was used by ~80% of the segmentation papers analyzed in this study. Other metrics would have resulted in a much smaller sample size (e.g., the Hausdorff distance, the most frequently used contour-based method, was reported by only 11%) .

Broadening the scope beyond MICCAI (R3/5): Paper screening is highly time-consuming, requiring hours per paper (multiple screeners + conflict resolution). We thus deemed for a representative sample of papers and decided for the MICCAI proceedings, which are representative of high-quality work in the field of biomedical image segmentation and also an exact match to the MICCAI audience, which the paper will be presented to.

Stratification by entity/model (R3): This is an excellent idea. As it is not allowed to put new experimental results in the rebuttal, we will consider this remark for a journal extension.

Calibration of SD approximation method (R4): We can complement the figure with more textual quantitative information.

Discussion of aspects related to inter-rater variability (R3): Excellent recommendation, which we will incorporate in the final version if space allows.

All: We will further clarify missing details in the final version.




Meta-Review

Meta-review not available, early accepted paper.



back to top