Abstract

The precision of contouring target structures and organs-at-risk (OAR) in radiotherapy planning is crucial for ensuring treatment efficacy and patient safety. Recent advancements in deep learning (DL) have significantly improved OAR contouring performance, yet the reliability of these models, especially in the presence of out-of-distribution (OOD) scenarios, remains a concern in clinical settings. This application study explores the integration of epistemic uncertainty estimation within the OAR contouring workflow to enable OOD detection in clinically relevant scenarios, using specifically compiled data. Furthermore, we introduce an advanced statistical method for OOD detection to enhance the methodological framework of uncertainty estimation. Our empirical evaluation demonstrates that epistemic uncertainty estimation is effective in identifying instances where model predictions are unreliable and may require an expert review. Notably, our approach achieves an AUC-ROC of 0.95 for OOD detection, with a specificity of 0.95 and a sensitivity of 0.92 for implant cases, underscoring its efficacy. This study addresses significant gaps in the current research landscape, such as the lack of ground truth for uncertainty estimation and limited empirical evaluations. This study addresses significant gaps in the current research landscape, such as the lack of ground truth for uncertainty estimation and limited empirical evaluations. Additionally, it provides a clinically relevant application of epistemic uncertainty estimation in an FDA-approved and widely used clinical solution for OAR segmentation from Varian, a Siemens Healthineers company, highlighting its practical benefits.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2441_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Tei_Towards_MICCAI2024,
        author = { Teichmann, Marvin Tom and Datar, Manasi and Kratzke, Lisa and Vega, Fernando and Ghesu, Florin C.},
        title = { { Towards Integrating Epistemic Uncertainty Estimation into the Radiotherapy Workflow } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This study applies uncertainty estimation approaches to pelvic CT segmentation and aims to detect novel physical artifacts that impact the quality of the segmentations. In addition, the study applies the Mahalanobis distance (MD) to an uncertainty distribution, whereas previous works apply it to the feature space. Finally, the study uses a chi-squared table to determine the optimal threshold for out-of-distribution (OOD) detection analysis.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The evaluated OOD datasets (physical artifacts) are clinically-relevant and important. Physical artifacts are often not well-represented in training datasets and cause significant failures once segmentation models are clinically-deployed. Displaying high uncertainty near these artifacts protects patients from subpar therapy due to automation bias.

    2) The study uses a chi-squared table to determine the threshold for OOD detection analysis based on the training dataset. In previous research, the MD threshold is most often determined based on the true positive rate on the test dataset. I assume that thresholds determined by the training data are more robust for downstream OOD detection performance than those on the test data.

    3) I appreciated that the authors filtered out uncertainties at organ boundaries. It helps to highlight the larger errors.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    For being an application study, the number of uncertainty estimation approaches considered (i.e., (1) a combo of MC Dropout and ensembling and 2) the proposed MD-based approach) is minimal. Furthermore, the manuscript is lacking important results. For example, there are no quantitative results for the combination of MC Dropout and deep ensembles. The only presented quantitative results are on the proposed MD-based OOD detection method. As the authors claim that their method improves upon feature-based MD methods, it is important to compare to feature-based approaches to substantiate this claim. While I do not desire the above-mentioned experiments in the rebuttal, I would expect them in an expanded work and would be interested to hear the authors arguments on why the number of uncertainty estimation approaches and the currently-presented results are sufficient for a MICCAI publication.

    In addition, I don’t agree with the authors’ argument for why their MD approach improves upon feature-based MD methods. They claim that feature-based methods require architectural changes. I find this argument to be misleading because feature-based MD methods don’t change the architecture of the segmentation model. The flattening of the features and subsequent dimensionailty reduction is done after the features are extracted post hoc from a trained segmentation model and is fairly seamless. Furthermore, the proposed MD method is presented on ensemble of learners, whereas feature-based MD methods can be applied to any architecture.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    I applaud the authors for being very specific on their implementation of the segmentation model. The implementation of the combination of MC Dropout and deep ensembling is also very clear. Access to code and the in-house dataset is not mentioned.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    While my main concerns are the weaknesses above, I’ve accumulated the rest of my opinions into comments below in an effort to help the authors. I know I added a lot, so please don’t feel like you have to respond to all of them in the limited rebuttal time. What I care about in the rebuttal is the outlined weaknesses above and the following question: “In 3.2, you say that the training images are ID. They aren’t considered ID for the AUROC calculation, right? This calculation should be between the 20 control and 27 OOD images as AUROC is susceptible to class imbalance.”

    Major comments:

    • The subheadings of the paper need to be updated. The first few paragraphs should be under an “Introduction” heading. The data section heading can be shortened to “Data” and put under the “Methods” or “Experiments & Results” section. My personal preference is that the final section heading doesn’t need “Future Work” in the heading.
    • The paper contains multiple buzzwords indicative of LLM model editing. While the paper is clearly written, there are multiple points where the use of buzzwords to make ideas stronger hurts the manuscript. For example, saying this work is “pioneering” in the application of uncertainty estimation to contouring for radiotherapy is absolutely overstating (abstract and contribution). Additionally, the word “critical” feels noticeably overused.
    • In the Intro, you mentioned the “limited emperical evaluations of uncertainty estimation methods… for real clinical applications.” You should reference some of these evaluations and state where they fall short, necessiting your study. Or how your application study differs from theirs and adds to the literature.
    • The second contribution borders overstating and redudancy with the first contribution. Where in your study do you advance the evaluation of uncertainty estimation in a radiotherapy workflow? I would simplify it to your introduction of a new OOD detection method. And the third one is redundant and overstating. If using the chi-squared table for threshold selection is novel, I would note this as a contribution instead.
    • It is not clear the relationship of the OOD data to the original training and test sets. If the implants were extracted from the training dataset, does that mean that the segmentation models were trained on 679 minus 13 images? Or were the models trained on all 679 training images, including the 13 implant images? If the models were not trained on the entire training dataset, this needs to be made explicit. Furthermore, did all the control images not contain any OOD aspects to them?
    • The relationship between your clinical solution and the segmentation model trained for this study is unclear - are they are the same thing?
    • In 3.2, the citation for the MD should be an earlier work that proposed using the MD for OOD detection. Like Lee et al. (2018) or Gonzalez et al. (2021) would be more appropriate. Or even a reference to the 1930ish definition of MD itself.
    • The MD assumes a Gaussian distribution. Lee et al. (2018) justified the Gaussian assumption for features extracted from classifiers with Gaussian Discriminant Analysis. Is the Gaussian assumption appropriate for uncertainties?
    • In Fig. 4 the maximum uncertainties are presented, but taking the maximum of the uncertainties wasn’t mentioned in the Methods.
    • The images used for the AUROC calculation should be specified.

    Minor comments:

    • The abstract could contain more specificity. For example, it would be helpful to include your application (pelvic CT segmentation). In addition, how OOD detection is defined in the study (i.e. by novelty - evaluated with implants, applicator devices, and a spacer). In addition, naming your OOD detection method and including this in your abstract will help future researchers to refer to your work.
    • For consistency with machine learning literature, I would change “control” dataset to “test” dataset.
    • In Fig 3, why the predictions on the left are considered to be confidences didn’t feel immediately clear. What do you mean by “confidence predictions”?

    Writing notes:

    • AUC-ROC is an abbreviation not defined in the abstract.
    • Deep learning abbreviation defined in abstract, but not in paper.
    • Add a comma after study to the phrase “In this application study we employ uncertainty estimation”.
    • In the Figure 1 caption, “radiotherapy”, “contouring”, “impants”, “rectal”, and “spacer” don’t need to be capitalized. Also, I would change the colon into a period or make (a),(b), and (c) be a part of the same list.
    • In “1. Femur Implants Dataset:” after “Figure 1(a), femoral doesn’t need to be capitalized.
    • The spelling of artifact is not consitent throughout (i.e. artifact or artefact).
    • The sentence after the introduction of every dataset about the importance of OOD detection feels redundant and out-of-place.
    • In “2. Brachytherapy…”, “a” can be deleted from “where a potential practitioners are unaware..”
    • What do you mean by “ResBlock counts of 1,1,3,1…”?
    • In 2.2 periods after italisized headings would help readability.
    • In Fig. 4, the capitalization of axes should be standardized. In addition, a boxplot would be more appropriate than a scatterplot for (b). The purpose of 10.64 instead of 10 isn’t readily apparent until the reader later learns that it is the threshold. I personally would add descriptions for the horizontal and vertical lines, take out the 10.64 tick, and avoid truncating the x-axis if possible. The “percent cases covered” label isn’t very clear. Adding in “training” could help.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper would have limited impact on the MICCAI community due to its limited scope. While the evaluted OOD scenarios are unique, it is not surprising that Bayesian approximation methods can locate large physical objects in a hand-picked qualitative evaluation. While the new MD method is interesting and achieves a high AUROC, it is not compared quantitatively with any other OOD detection approaches, including the feature-based MD approaches that it claims to replace.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The main contribution of this paper is that it demonstrates the utility of a combined uncertainty estimation/OOD detection method on an FDA-approved segmentation solution applied to several real-world datasets consisting of physical artifacts. Such OOD datasets are important and are rare in related literature, thereby satisfying MICCAI’s call for application papers, which is why I now recommend “acceptance”. My previous decision was based on considering the Mahalanobis distance applied to uncertainty estimates as a methodological contribution, which would merit more scientific rigor if this were a methodological paper.

    I still believe the lack of comparative evaluations hurts the utility of the paper. The authors cited Zhou et al.’s delineated research gap of the field lacking empirical evaluations as the impetus for their work. But Zhou et al. stated that the reason these empirical evaluations are needed is “to compare and evaluate different methods, and to determine which methods are most effective and efficient in different tasks.” While the included datasets are novel and important, this paper does not compare any methods, thereby limiting the utility of the application.

    As a side note to the authors, I agree with Reviewer #6 that including information on the computational feasibility of your methods would be useful in a camera-ready version. I also apologize that it didn’t connect originally that the MCD/Ensembling estimates were quantitatively evaluated through the MD results. If you have the information readily available, including the results of using the uncertainty estimates directly for OOD detection would be useful as an ablation study on your proposed MD methodology.



Review #2

  • Please describe the contribution of the paper

    This paper presents a method for uncertainty estimation for organ-at-risk segmentation in radiotherapy planning. Segmentation uncertainty is estimated with MC dropout and a deep ensemble model for the purpose of identifying out-of-distribution (OOD) cases potentially requiring further analysis. The methods was evaluated using a variant of a U-Net on a private dataset of CT data (N=679 training, N=20 control) with OOD cases due to femoral head implants (N=130), brachytherapy (N=12) and a rectal spacer (N=1) with promising results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The estimation of OOD segmentations is valuable and significant.

    The combination of MC dropout, ensembling and Mahalanobis distance estimation with ROC evaluation yields a sensible formulation that addresses a clinically relevant problem.

    The dataset has strong utility for the evaluation.

    The evaluation is convincing.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Comparison to alternate techniques would be useful.

    Evaluation on other, ideally public, data would make the evaluation more comprehensive.

    Figure 3 indicates “raw” and “processed” maximum uncertainties. This is not explained.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Evaluation on public data and comparison to some alternate techniques (even if not perfect matches) would be useful.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a novel approach to OOD detection in the context of OAR segmentation. While the evaluation could be more extensive, this paper is still a useful contribution.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Nice paper with strengths outweighing weaknesses - worthy of acceptance.



Review #3

  • Please describe the contribution of the paper

    This paper focuses on estimating epistemic uncertainties and integrating these into deep learning ensemble models for detecting contours marked organs-at-risk within CT scans, specifically to address reliability concerns of these models in out-of-distribution scenarios. The authors clearly identify three key applied datasets (femur implants, brachytherapy, and hydrogel rectal spacer) for which their proposed methods critically address current gaps in the state-of-the-art.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper is written and organized very well. The problem is stated clearly, their approach, modeling choices, and discussion of results are all very easy to follow. I also think this seems to address an important question in understanding predictive uncertainties for complex deep learning tasks.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Epistemic uncertainty is only addressed with regards to a single model class type (a deep learner for contour detection), and there is no comparison to other uncertainty estimation approaches for this task.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The deep learning architecture discussion is written fairly well, and their choices are clearly outlined. They do not provide any source code or data.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Have the authors assessed the assumption of a shared covariance matrix across classes in modeling the class-conditional uncertainty score distributions? Further, is the normality assumption appropriate here? This seems like an important check since thresholds are chosen using the fact that the Mahalanobis distance follows a chi-square distribution.

    How much more expensive (i.e., in terms of computational time) is it to obtain uncertainties here with a deep ensemble using MC dropout?

    Is there a benefit to considering a wider variety of base learners, as opposed to just a similar deep model for each one? This seems like it might better get at understanding the epistemic uncertainty.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper addresses a critical gap in the existing literature (especially applied to radiotherapy planning), is very well-organized and justified, and the authors give great examples on which their method is applied.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

Mayor Points (MP)

[MP 1: Comparative Evaluation] All three reviewers highlight the lack of comparative evaluation (CE) with other methods as a potential weak point in our paper. The strongest concern is raised by R3, questioning whether a paper without such evaluation is suitable for MICCAI at all.

We agree that CE is an important scientific tool and acknowledge that it is a common practice in MICCAI papers. However, the primary goal of our paper is to showcase the application and demonstrate that epistemic uncertainty estimation can yield quantifiable improvements in an FDA-approved and widely used pipeline for OAR segmentation. This addresses a relevant gap between research and application, as discussed by the cited review paper. For this goal, a CE is not necessary and would be a distraction from our primary objective. As an industrial player, we see our primary role in communicating findings in a product environment. We hope that the academic community can build up on our work by performing comparative analyses on public datasets.

Regarding the impact, we believe our work can advance the scientific discussion if academic researchers utilize our findings to underscore the clinical importance of further research in uncertainty quantification. We further hope to inspire the curation of similar benchmark tasks based on public data, which can be used for CEs in future studies.

To address the reviewers’ concerns, we propose the following actions for the camera-ready version of our paper: 1) Expand the future work section to encourage academic players to perform CEs and design similar tasks based on public data. 2) Refine the introduction and abstract to clarify our scope and intended message.

We believe that there is a space at MICCAI for application studies like ours, which focus on demonstrating practical and clinical relevance.

[MP 2: MD Methods] R3 further criticises our claim that we improve upon other MD-based methods. This is a misunderstanding. We would like to clarify that we do not intend to claim superiority over other methods. As highlighted in MP 1, our main message is to demonstrate that epistemic uncertainty estimation can be beneficial in a real-world application. We also want to point out that we do not make such a claim in the abstract or introduction. However, we acknowledge that Section 2.3 discusses potential advantages of our MD-based statistical analysis. We believe these advantages (no risk of feature collapse, no need for architectural changes) are valid points. To address this concern, we propose to tone down the language in Section 2.3. This adjustment will help prevent misunderstandings in the future and sharpen our main message. Thank you for the feedback.

[Conclusion] We appreciate that all reviewers highlight the clarity of the manuscript and agree on the importance and clinical relevance of the topic. We hope that the clarifications provided above will offer further insight into the intended focus and decision-making regarding the scope of our work.

Major Questions

@R1 regarding Raw vs Processed: Post-processing is applied to the uncertainty outputs as detailed in Sec. 2.2 (Inference and Uncertainty Quantification). ‘Raw’ refers to the initial output from the uncertainty model, while ‘Processed’ denotes the output after post-processing.

@R3 regarding AUROC calculation: We confirm that no training images were used in the AUROC computation. It involves only test ID (control) vs OOD images, ensuring balanced classes.

@R6 and @R3 regarding statistical assumptions for the Mahalanobis distance: We have confirmed with a hypothesis test (Hotelling T^2) that the uncertainty score follows a multivariate Gaussian distribution, parameterized with the corresponding mean and covariance. The shared covariance is intuitively justified by the fact that all organs are predicted by the same model, resulting in correlated outputs. We will mention the hypothesis test in the paper.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper should be submitted to the “Clinical Translation of Methodology” track. Although I have some reservations about the practicality of MCD and ensemble methods in clinical workflows, I will endorse the paper, as all reviewers have expressed their approval.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper should be submitted to the “Clinical Translation of Methodology” track. Although I have some reservations about the practicality of MCD and ensemble methods in clinical workflows, I will endorse the paper, as all reviewers have expressed their approval.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Two out of the three original reviewers already recommended acceptance. The only only recommending rejection (2) increased their score to Accept (5) after realizing that the paper fitted well the Clinical Application category (thus I am recommending Clinical Translation of Methodology track for this one). Some of the comments provided by reviewer 3 seem very meaningful though, I would like to encourage authors to incorporate them as much as possible in the camera-ready version of the paper. Congratulations!

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Two out of the three original reviewers already recommended acceptance. The only only recommending rejection (2) increased their score to Accept (5) after realizing that the paper fitted well the Clinical Application category (thus I am recommending Clinical Translation of Methodology track for this one). Some of the comments provided by reviewer 3 seem very meaningful though, I would like to encourage authors to incorporate them as much as possible in the camera-ready version of the paper. Congratulations!



back to top