Abstract

Accurate segmentation of orbital bones in facial computed tomography (CT) images is essential for the creation of customized implants for reconstruction of defected orbital bones, particularly challenging due to the ambiguous boundaries and thin structures such as the orbital medial wall and orbital floor. In these ambiguous regions, existing segmentation approaches often output disconnected or under-segmented results. We propose a novel framework that corrects segmentation results by leveraging consensus from multiple diffusion model outputs. Our approach employs a conditional Bernoulli diffusion model trained on diverse annotation patterns per image to generate multiple plausible segmentations, followed by a consensus-driven correction that incorporates position proximity, consensus level similarity, and gradient direction similarity to correct challenging regions. Experimental results demonstrate that our method outperforms existing methods, significantly improving recall in ambiguous regions while preserving the continuity of thin structures. Furthermore, our method automates the manual process of segmentation result correction and can be applied to image-guided surgical planning and surgery.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/5352_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{AnJin_From_MICCAI2025,
        author = { An, Jinseo and Lee, Min Jin and Shim, Kyu Won and Hong, Helen},
        title = { { From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15972},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a diffusion-based model for image segmentation with the application of segmentation of orbital medial wall. Here, the reverse path of the model is trained to denoise into a binary mask. (also the noise is Bernouli rather than Gaussian).

    Having the segmentation formulated as diffusions enables sampling of multiple plausible segmentations (learned from 3 different raters). A consensus is then computed from multiple samples and MRF-like regularization is then used to compute the final segmentation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • significant improvement on segmentation of thin structures where the annotations are often inconsistent or result in several discontinuous regions, and where traditional methods (U-Net, etc) fail
    • elegant incorporation of annotations from multiple raters
    • implicit modelling of uncertain
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • conditioning with the CT image is unclear, the architecture of the model is not described in sufficient detail for the paper to be reproducible. Is this based on one of the cited works?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Please more details about the architecture, especially of the Unet and the conditioning - how is it applied?

    It could be interesting if the expertise of the raters could somehow be taken into the account training the model.

    Is the generated consensus more aligned with the experienced raters?

    What is the effect of the number of generated segmentations on the consensus? How many are needed for a stable measure?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper proposes an interesting approach to model the segmentation via Bernoulli diffusion. Currently, the paper is difficult to reproduce due to significant details missing.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper is focussed on developing methods to segment the orbital floor with the aim of developing technology for surgical planning for orbital reconstruction. The authors point out that this particular clinical need is challenging because the orbital floor bones are extremely thin and not well resolved by clinical computer tomography imaging. This has led to many automated methods creating segmentation with incorrect topology or holes. The authors develop a technique that models consensus between multiple observers within a Bernoulli noise diffusion model . The application of consensus maps and the Bernoulli noise diffusion model is novel and shows promising results for improving this segmentation task over other state of the art methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors well describe the clinical need of orbital floor reconstruction. This application is well justified. The authors apply their novel methods to a novel data set.

    The framework is novel, combining consensus correction and Bernoilli noise correction.

    The application of this framework to orbital reconstruction is novel

    The idea of consensus driven correction has many possible applications. This could be a powerful approach to improve performance in many domains.

    The article is well structured

    The authors compare their results to other methods commonly used for segmentation and show superior performance.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The authors present limited ablation experiments this makes interpretation of the results more challenging. What aspects of their algorithm are most important for achieving the improved results. The results they imply that the Bernoulli noise diffusion model has nearly the same performance as their suggested model incorporating consensus correction. Why do you think this is? It would be helpful context to have the consensus correction implemented in another network to understand how this alteration affects performance on its own.

    The authors apply their technique on a small dataset. This could limit generalizability.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The authors apply their technique to a custom in-house data set and the formulation of their algorithm is novel making reproducibility a challenge.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A novel algorithm solving a well justified clinical need. The authors point to the difficulties and apply well reasoned strategies to address them. The approaches improve upon state of the art and both the Bernoulli noise and conses correction appear to improve performance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper introduces a novel deep learning-based method for segmenting thin structures in facial CT scans. The primary contribution lies in addressing regions with high uncertainty, areas where even expert annotators show low agreement. To tackle this, the authors propose a diffusion-based approach that leverages the probabilistic nature of diffusion models to capture the variability and uncertainty inherent in the annotation masks. Based on the resulting uncertainty estimates, a post-processing step called consensus-driven correction is applied. This step refines and corrects the segmentation by incorporating neighborhood information, such as gradient similarity and spatial proximity..

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed method introduces an effective approach for modeling uncertainty in segmentation tasks, successfully highlighting the challenges posed by interobserver variability. The results clearly demonstrate the advantages of diffusion models over traditional CNNs, particularly in scenarios where uncertain regions are critical. Furthermore, compared to other diffusion-based approaches, the proposed method stands out by incorporating the consensus-driven correction strategy, which provides a notable improvement in segmentation accuracy.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The study is limited by a relatively small dataset, with only 71 cases used for training and validation, and just three annotations per case. This raises concerns about the generalizability of the proposed approach. Additionally, in the methodology section, the selection of the theta parameters used in the energy minimization process appears arbitrary. The paper relies on fixed values without providing a systematic method for tuning these parameters, relying instead on empirical trial and error. Lastly, while the results suggest that the consensus-driven correction improves segmentation performance, it remains unclear whether these improvements are truly attributable to this strategy. It is possible that similar gains could be achieved through simpler post-processing techniques, such as removing small isolated regions or connecting disjoint components.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The authors are encouraged to make the code publicly available, particularly for the consensus-driven correction step. Given its potential applicability to generic segmentation masks and associated uncertainty maps, providing access to this component would enable the research community to more easily adopt, validate, and extend the method across a wider range of applications. Furthermore, given the central role of inter-observer variability in this study, it would be valuable to report the ICC (Intra-class Correlation Coefficient) values for the annotation masks. This would serve as a useful reference for future studies adopting a similar approach.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I recommend this study for acceptance due to its novel perspective on modeling uncertainty in segmentation tasks. It presents a valuable contribution, particularly in addressing inter-observer variability, and offers a promising direction that merits presentation, discussion, and further development within the MICCAI community.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank the reviewers for their constructive comments. Reviewer #1 acknowledged that our study addresses a well-justified clinical need; Reviewer #3 highlighted both the significant improvement in thin-structure segmentation and the elegant use of multi-rater annotations; and Reviewer #4 commended our effective uncertainty-modeling approach and its consensus-driven correction strategy.

<Reviewer #1, #4> Our consensus-driven correction approach effectively preserves continuity in thin structures, as visually demonstrated in Fig. 3 where our method maintains connectivity in regions where comparison methods show disconnection. This approach significantly improves recall (87.83% in orbital medial wall and 93.24% in orbital floor), which is critical for ambiguous boundaries prone to under-segmentation. We will provide additional experiments applying our correction to other segmentation methods in future work. (Note: “BerDiff” refers to results from averaging 200 diffusion model segmentations, while “Ours” refers to results after applying our correction to these segmentations.) Regarding dataset size, we acknowledge this limitation. Although we have access to 355 facial CT images, obtaining multiple annotations from three annotators following the same annotation protocol presents significant challenges, which limited our study to 71 cases. For future work, we plan to utilize public datasets with similar characteristics to enhance the generalizability of our method.

<Reviewer #3> The implementation of our diffusion model is based on BerDiff [13], and the CT images are concatenated with noisy masks. We will provide clearer description in the final version. Our dataset includes annotations from one neurosurgeon and two senior medical students. Since the majority of annotations are provided by medical students, our model may be more influenced by their annotation patterns. Regarding segmentation count, we experimented with generating 10 to 200 segmentations and presented the optimal results (200) in our paper, observing that performance generally improves with increasing numbers. In future work, we will investigate both the alignment between generated consensus and individual annotators’ patterns, as well as conduct additional experiments on optimal segmentation counts.

<Reviewer #4> For parameter selection, the scaling factors (θ) for the Gaussian kernels were determined through multiple experiments using various values based on the image size, the range of consensus levels, and the range of gradient directions. Concerning ICC values, we have already reported these in the Introduction section of our paper, citing [3], which notes lower consensus in thin bone regions (ICC=0.715 for orbital medial wall, ICC=0.824 for orbital floor) compared to whole orbital bone (ICC=0.931). Our method effectively improves segmentation in thin bone structures by addressing disconnected regions and enhancing recall, and it does not simply remove isolated regions or connect disjoint components like post-processing techniques do. Our approach has two key components working together: First, the diffusion model trained on multiple annotations captures the inherent variability among three annotators, generating diverse plausible segmentations. Second, our correction method integrates position proximity, consensus level similarity, and gradient direction similarity from these multiple generated segmentations to make informed decisions about region connectivity. Through this approach, we preserve the continuity of thin structures in ambiguous regions.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top