Abstract

Supervised methods for 3D anatomy segmentation demonstrate superior performance but are often limited by the availability of annotated data. This limitation has led to a growing interest in self-supervised approaches in tandem with the abundance of available un-annotated data. Slice propagation has recently emerged as an self-supervised approach that leverages slice registration as a self-supervised task to achieve full anatomy segmentation with minimal supervision. This approach significantly reduces the need for domain expertise, time and the cost associated with building fully annotated datasets required for training segmentation networks. However, this shift toward reduced supervision via deterministic networks raises concerns about the trustworthiness and reliability of predictions, especially when compared with more accurate supervised approaches. To address this concern, we propose the integration of calibrated uncertainty quantification (UQ) into slice propagation methods, providing insights into the model’s predictive reliability and confidence levels. Incorporating uncertainty measures enhances user confidence in self-supervised approaches, thereby improving their practical applicability. We conducted experiments on three datasets for 3D abdominal segmentation using five different UQ methods. The results illustrate that incorporating UQ not only improves model trustworthiness, but also segmentation accuracy. Furthermore, our analysis reveals various failure modes of slice propagation methods that might not be immediately apparent to end-users. This opens up new research avenues to improve the accuracy and trustworthiness of slice propagation methods.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3515_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3515_supp.zip

Link to the Code Repository

https://github.com/RachaellNihalaani/SlicePropUQ

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Nih_Estimation_MICCAI2024,
        author = { Nihalaani, Rachaell and Kataria, Tushar and Adams, Jadie and Elhabian, Shireen Y.},
        title = { { Estimation and Analysis of Slice Propagation Uncertainty in 3D Anatomy Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes to incorporate epistemic uncertainty quantification techniques (UQ) into two self-supervised slice propagation methods, known as Sli2Vol and Vol2Flow in the literature, to improve the interpretability of the 3D anatomy segmentation. For UQ, deep ensemble, batch ensemble, Monte Carlo dropout, Stochastic Weight Averaging Gaussian were used. The authors utilized publicly available datasets for training and testing.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper demonstrates the use of various uncertainty quantification techniques on the two already available self-supervised slice propagation segmentation techniques.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The paper seems to lack novelty in terms of directly using the above-mentioned UQ techniques and the slice propagation methods without proposing any modification.
    • The objective of the paper in terms of the uncertainty quantification task is not clear. Just providing the relationship between Dice and quantified uncertainty was explored in prior work between the (variations of) Dice score and the quantified uncertainty (one example DOI: 10.1117/12.2548722).
    • Results suggest concrete dropout could outperform other UQ techniques in the segmentation task, but it is not clear whether this is the message of the paper.
    • There were several dataset used in the study yet the key identfying differences between them were missing. The effect of dataset shifts between training and testing datasets on the results were not discussed, e.g., training on lymph nodes and testing on liver volumes etc.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • In page 7, there are two references to non-existing figures.
    • Please elaborate more on why 4 ensemble members were chosen for ensembling-based model uncertainty as existing work typically suggests ensembling 5 networks.
    • It would be interesting to see how the uncertainty varies across different methods for a given anatomy (not just concrete dropout).
    • A figure summarizing the key technical differences between two slice propagation methods (Sli2Vol and Vol2Flow) could be helpful.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Althought filling the gap in the literature in terms of applying various uncertainty quantification techniques to slice propagation-based self-supervised segmentation, the paper lacks novelty due to directly applying already existing techniques.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors present a study on incorporating epistemic uncertainty estimation into two reference architectures for 3D organ segmentation based on slice propagation. They have implemented and tested 5 different approaches for epistemic UQ and conducted an extensive benchmark on 3 public datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper addresses an important issue of evaluating the accuracy and uncertainty of sparsely supervised 3D segmentation models based on recent slice propagation methods. The authors have conducted experiments on state-of-the-art slice propagation methods using different popular datasets for 3D organ segmentation. The benchmark includes different relevant metrics including Uncertainty calibration.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    A main weakness is an insufficient justification for the current study. Authors briefly argue why they disregard aleatoric uncertainty in their study design but fail to provide convincing arguments to justify this choice. Likewise, the choice of epistemic UQ methods is not motivated. Finally, while the authors address an important issue for practical implementation, the novelty and impact based on the insights gained seems somewhat limited. More discussion on these points is provided in the detailed comments.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The methods are clearly described. UQ methods and datasets are available state of the art.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Authors are encouraged to improve their description in the rebuttal phase to address the weaknesses described above.

    • authors provide little motivation for disregarding aleatoric uncertainty. A more detailed justification why only epistemic uncertainty is considered would be appreciated. To increase the potential impact of the study, a comparative analysis of the benefits of each aleatoric and epistemic uncertainty to the task at hand should be provided. As in almost any deep learning based task, both model and task uncertainty are closely linked and it may be difficult to isolate the effects with any one method.
    • Likewise, the choice of epistemic UQ methods is not motivated well. There are indeed many epistemic UQ methods that have been proposed (see e.g., Abdar et al., https://doi.org/10.1016/j.inffus.2021.05.008). Authors should motivate their choice based on the task at hand.
    • While the authors address an important issue, the study merely combines known state of the art approaches for slice propagation based 3D segmentation and epistemic uncertainty quantification. The novelty and significance of the results is thus somewhat limited. I think it is still interesting to the general community but the impact would be significantly strengthened if authors expanded on their study to include more use cases (either different sparse label approaches for 3D organ segmentation, or other diagnostic imaging use cases).
    • In training and evaluating the models and uncertainty, authors rely on the labeling quality of the public datasets used. As with any large base dataset, there is a likelihood for annotation errors. This could potentially affect the quantitative results achieved. It would be very insightful, if authors would go through the effort of manually checking the predictions at least for certain results that are predictive of hypothesized model performance.

    Minor issues:

    • Some references to Figures on page 7 are missing.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The study addresses an important issue with potential impact to the community even though the impact and novelty is somewhat limited based on the narrow use case targeted and limited original contribution. Authors should improve the motivation of the study as outlined above and potentially address other points from the detailed comments in the rebuttal.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    After reading the rebuttal and the other reviews with essentially similar concerns but different overall opinions, I have come to the conclusion that the presented analysis and results are interesting to the community in particular with respect to the planned “Clinical translation of methodology” session.



Review #3

  • Please describe the contribution of the paper

    The authors propose to study Uncertainty Quantification (UQ) techniques applied to Slice Propagation (SP). Slice Propagation is a self-supervised approach for 3D images, thus being of interest for the medical image domain. At test time, the system is provided a segmentation for a structure of interest for one slice of a volume and is capable of propagating that segmentation to neighbouring slices. Being a self-supervised technique, it’s performance is subpar in comparison to fully-supervised approaches. The authors use this as motivation to study whether (and which) UQ techniques can be used to identify regions of segmentation failure and provide an overall confidence score of the result, ultimately increasing the trustworthiness of the system. This is more of a benchmarking study than a technical contribution.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    [Motivation] Reliability and trustworthiness are pivotal for the success of (semi-)automated solutions in the clinical practice. At the same time, development of robust deep learning system usually relies on large amounts of labelled data, which is difficult and costly to obtain. This study addresses both these concerns, thus being very relevant. [Evaluation] The study is evaluated using 3 distinct public datasets. The authors evaluate two SP approaches with 5 distinct UQ techniques, making it a very comprehensive study. Also, a good selection of evaluation metrics is used.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    [Lack of statistical test] The authors conduct a large number of experiments, but the findings do not feel as meaningful as they could due to the lack of statistical tests; [Lack of technical novelty] There is no technical novelty in this paper. I do believe that this type of studies have their merits and are essential for the community, but I’m not certain MICCAI is the right conference for this type of material; [Conclusions] The conclusion seems somehow lacklustre in comparison to the performed study. In my opinion, the authors should have conveyed a clear message in what they believe should be the UQ approach to use and hint at research directions to solve the issues they found with the existing state-of-the-art.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I would like to congratulate the authors for the conducted work. Their paper is easy to follow and provides important baselines in the field of study.

    Please find some suggestions which I believe could help to improve your work (no particular order):

    • In table 1, would it not be possible to compute R-AUC for the Base (w/o UQ)? For instance, couldn’t the output probability of the system be used as an uncertainty metric? This would increase the meaningfulness of your findings;
    • Again in Table 1, it is not very clear to me how the segmentation performance for approaches with UQ is computed, because both Base (w/o UQ) and UQ methods are presented with standard deviation (which I assume to be computed as a consequence of a sample-wise evaluation). For the UQ methods, how are the segmentations obtained? Are they the average over multiple outputs? Also, what threshold was used to obtained these segmentations? Is the performance somehow dependent on this threshold?
    • Following my listed main weaknesses: — It would be very meaningful if statistical tests for the reported performances were computed. This would increase the robustness of the findings; — I believe your conclusion could be improved to convey a stronger message about your findings;
    • References to supplementary material are not being adequately handled (“??”);
    • The plots in Fig. 1 do not seem correct. In particular: -1) there seems to be a misalignment between the numbers written in the horizontal axis and the plotted data (e.g. the uncertainty 0 blue dot does not match the position of the 0 in the horizontal axis); -2) are the sample images correct? for instance, the sample arrowed to relative position 20 does not seem to have a DSC>.9. Likewise, the one arrowed to 30 already has no segmentation, but DSC is still being reported.
    • Could you rephrase the sentence: “With significant amount (…) generalize.”

    Thank you

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I believe this study has its merits, as it seems to have been carefully conduct. However, the lack of technical novelty combined with no strong conclusions have negatively affected my recommendation.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    After carefully reading the rebuttal for the reviewers’ comments, I believe this paper deserves a spot at the main MICCAI conference. Thus, I have decided to maintain my overall score “Weak Accept”.

    In particular, I think that even though technical novelty is missing, this study is very relevant for the medical image analysis community. Indeed, the contributions related to having this sort of benchmarking tools publicly available outweight the lack of technical novely. I also think that the authors did a good job clarifying the concerns of the other reviewers.

    My major negative point is that the authors did not address my concern regarding the lack of statistical tests, which I believe would have increased the robustness of their findings.

    Best of luck,




Author Feedback

We thank the reviewers (R) for their insightful feedback. This discussion aims to clarify confusion and fill in missing information from the paper. We kindly request reviewers to refer to our supplementary material for a comprehensive comparative analysis of DSC and uncertainty metrics variability across slice propagation methods, datasets, and UQ methods. It also contains GIFs demonstrating uncertainty progression in volumes for different UQ methods (R4). Editorial comments will be addressed in the final version.

NOVELTY and CONCLUSION (R1, R4, R6): Self-supervised slice propagation methods promise to reduce the burden associated with automatic segmentation. However, as our work demonstrates, such approaches are prone to failure and thus cannot be trusted in sensitive clinical scenarios without UQ. Our work exposes and addresses this significant gap. While we do not propose a novel method of epistemic UQ estimation, we provide open-source, non-trivial extensions of existing methods to this new task and benchmark their performance. The resulting contribution is two-fold: we both enhance the safety and usability of slice propagation methods and provide an analysis of the effectiveness of epistemic UQ estimation methods in a new context, which is significantly underexplored. We highlight the strengths and limitations of current approaches, paving the way for future innovations required for slice propagation methods. The guidance on scalable reliability and trustworthiness via UQ is also valuable to the MICCAI community, where safety and real-world applicability are crucial.

CHOICE OF UQ METHODS(R1,R4): The selection of 5 diverse SOTA scalable [2] epistemic UQ methods is informed by their varied approaches (covering frequentist and bayesian perspectives) to addressing model uncertainty effectively. Each method was chosen for its unique strengths: Deep and Batch Ensembles provide robustness through model averaging; MC Dropout and Concrete Dropout facilitate practical uncertainty estimation during training and inference, critical for deployment in clinical environments; and SWAG captures variability in model parameters through its approximation of the posterior distribution. This diverse toolkit allows us to comprehensively evaluate and enhance the predictive reliability and interpretability of self-supervised slice propagation methods. We choose to use only four ensemble members based on empirical findings, balancing computational(limited GPU memory) requirement and performance gain.

EPISTEMIC vs. ALEATORIC UNCERTAINTY(R1): Epistemic UQ is a more difficult task with multiple proposed methods all of which rely on assumptions/have limitations to make them scalable - necessitating the need for benchmarking. Aleatoric uncertainty (UQ) is typically learned as a function of input data by making the output probabilistic. This method is well-established, so it’s less of an open research question. Since this paper focuses on weak/sparse supervision(via single slice annotation), quantifying the model’s knowledge base trained using the available supervision via Epistemic UQ is more important clinically and for future design of new algorithms. We acknowledge the significance of both uncertainty types in comprehensive diagnostics. Our future aim is to integrate advanced modeling techniques, such as probabilistic deep learning, to effectively address and mitigate aleatoric uncertainty.

DATASETS(R1,R4): We selected datasets aligned with those used in the original validations of Sli2Vol and Vol2Flow, ensuring the relevance and comparability of our findings. Sli2Vol and Vol2Flow methods are domain-agnostic due to the underlying self-supervised registration method (using edge profiles similar to MIND features). Our experiments demonstrated minimal differences in model performance due to dataset shifts(organ or domain, ct vs mri). However, these methods are sensitive to post processing techniques, which we will share on our GitHub Repo.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    While this paper does not significantly contribute to methodological innovation, its application is noteworthy and has the potential to greatly impact manual annotation processes.

    Additionally, it is advisable for the authors to consider submitting to the “clinical translation track” rather than the “methodology track.”

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    While this paper does not significantly contribute to methodological innovation, its application is noteworthy and has the potential to greatly impact manual annotation processes.

    Additionally, it is advisable for the authors to consider submitting to the “clinical translation track” rather than the “methodology track.”



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Accept – the idea of using uncertainty quantification with slice propagation is definitely a worth while topic. I fundamentally disagree with reviewers that only technically novel papers are suitable for miccai (i.e. reviewer instructions are they should consider novelty of both methodology AND application) as far as I can tell the application is fairly novel, fills in an important gap in the field. While I disagree with focusing only on aleatoric uncertainty the authors have clear reasoning why specifically epistemic uncertainty was selected for this study and they appear to have done a fair survey of different methods.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Accept – the idea of using uncertainty quantification with slice propagation is definitely a worth while topic. I fundamentally disagree with reviewers that only technically novel papers are suitable for miccai (i.e. reviewer instructions are they should consider novelty of both methodology AND application) as far as I can tell the application is fairly novel, fills in an important gap in the field. While I disagree with focusing only on aleatoric uncertainty the authors have clear reasoning why specifically epistemic uncertainty was selected for this study and they appear to have done a fair survey of different methods.



back to top