Abstract

Accurate segmentation of medical images is challenging due to unclear lesion boundaries and mask variability. We introduce Segmentation Schödinger Bridge (SSB), the first application of Schödinger Bridge for ambiguous medical image segmentation, modelling joint image-mask dynamics to enhance performance. SSB preserves structural integrity, delineates unclear boundaries without additional guidance, and maintains diversity using a novel loss function. We further propose the Diversity Divergence Index (DDI ) to quantify inter-rater variability, capturing both diversity and consensus. SSB achieves state-of-the-art performance on LIDC-IDRI, COCA, and RACER (in-house) datasets.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4859_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{BarLal_Ambiguous_MICCAI2025,
        author = { Baru, Lalith Bharadwaj and Dadi, Kamalaker and Chakraborti, Tapabrata and Bapi, Raju S.},
        title = { { Ambiguous Medical Image Segmentation Using Diffusion Schrödinger Bridge } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    Introduce a method based on Schrodinger Bridge for improved segmentation of ambiguous edge regions under label uncertainty, estimates noise in both conditional and unconditional label setting. The method proposed takes an input images and through a diffusion process transforms it into a segmentation mask in order to preserve anatomical structure.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Quantitative results show a moderate improvement over established baselines that tackle multi-rater segmentation, while maintaining a lower computational cost due to lower number of function evaluations compared to CIMD and CCCDM methods.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The performance of the proposed and comparative methods under unknown number of output samples. Based on my understanding of the paper and the comparative methods, varying the number of output masks could result in different performances, it is unclear from the paper if the number of samples for all methods was kept constant? And if at a given constant number of samples is the compute cost / performance better?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Questions:

    • The Diversity Divergence Index basically measuring the dice score between each prediction and each ground truth to form both expert and generated distributions - as suggested b Rahman (CVPR 2023) - and measuring the difference between each expert and each generated distribution. Combined this should trend similarly to GED?
    • Figure 2 actually contradict the statements made by the authors in the text and the caption - “SSB generalizes well, producing diverse yet expert-aligned annotations, unlike other methods that fail to overlap with experts or provide meaningful variability.” but the qualitative images do not show high diversity, for example, for LIDC the ground truth masks have a no segmentation mask, while SSB does not produce that, why?
    • Why is a larger variability in the prediction desired? If there are multiple “ground truths”, shouldn’t the assumption be that they’re all some noisy version of a single, albeit unknown, ground truth?
    • Since SSB++ takes in the image and slowly transforms it to a segmentation mask, how would this method behave under distributional shift? What about the calibration of the network predictions compared to the baselines?
    • Would the Diversity Divergence index change with larger number of output samples? How many samples were generated from the Prob. Unet / Phi Seg?

    Other comments:

    • Abstract “we introduce segmentation schodinger bridge bridge” bridge is repeated and Schrodinger is mis-spelled
    • Several instances in the abstract and introduction and methods of misspelling of “Schrödinger” (missing r after h)
    • Page 5 - “Equation (??)” missing reference.
    • Page 5: Algorithm 2 “(ref eq. (??))” missing reference
    • Page 5: noise estimation equation – multiple undefined variables.
    • Equation 4: what is gamma?
    • Figure 2 c - label is home not RACER? Please be consistent in referring to the Home (RACER) dataset.
    • ADM acronym not defined.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Multiple questions arise from the paper due to a lack of discussions on the specific points regarding number of output samples and the introduced DDI metric.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed the majority of the reviewers’ questions and clarified the contribution of their method from a diffusion model perspective. Specifically, they highlight that their approach differs from existing diffusion-based methods by directly generating the segmentation mask from the raw input image through a diffusion process.

    While the current version of the manuscript suffers from issues of clarity and organization, the authors have acknowledged these shortcomings and mentioned that they will improve the writing in the final revision. Given the methodological contribution and the potential of this work to advance the use of diffusion models in segmentation tasks, I recommend acceptance.



Review #2

  • Please describe the contribution of the paper

    The paper proposes Segmentation Schrodinger Bridge (SSB) - a framework for applying Diffusion Schrodinger Bridge models to generate multiple segmentations. The Schrodinger Bridge allows the model to learn a map from the distribution of images to the distribution of segmentations. The authors claim that diffusing from images (as opposed to diffusing from Gaussian noise) helps preserve anatomical structure. The authors use the Classifier Free Guidance based diffusion loss combined with a Dice loss for training. In addition, a new evaluation metric is presented - the Diversity Divergence Index (D_DDI) - to quantify both inter-rater and generated segmentation diversity. The authors report state-of-the-art performance on three datasets (LIDC-IDRI, Stanford COCA, and an in-house RACER dataset) while significantly reducing the computational cost (50 NFEs vs. 1000 in competing methods).

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The framework is performant both in accuracy and number of function evaluations
    • A new metric for diversity agreement is presented, clearly explained, and compared alongside standard metrics in the community
    • The authors provide qualitative comparisons of their models to baseline in addition to quantitative
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The results do not have any statistical errors or confidence intervals associated with them, making it difficult to ascertain the consistency of the improvements
    • From my perspective, the performance improvements from SSB++ have to be discounted as the authors introduced a bias when performing the task specific optimizations. Had these optimized networks also been used for the baseline frameworks, the comparison would be more fair. Luckily SSB is performant without the optimizations.
    • At multiple points the authors claim that diffusion from Gaussian noise causes loss of structural integrity. However this claim is never justified theoretically or empirically. A reasonable verification test could be a conditional SSB model that goes from Gaussian noise to masks.
    • There is a general lack of refinement in the writing e.g. in the algorithms, we see “??” placeholders. The use of heavy mathematical notation without accompanying intuitive explanations could be a barrier for readers less familiar with stochastic processes and diffusion-based methods. Occasional typographical errors (e.g., “Schödinger Bridge Bridge” in the abstract) reduce the overall polish.
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • The authors need to be consistent in the variables they use. In section 3.1, the data (image) distribution is p_A and mask distribution is p_B. Yet in the sampling algorithm, they start from p_B?
    • The authors do not clearly explain why the $e_{\theta}$ model can output $X_0$ directly. Are they hinting at the use of a DDIM sampler? This was never made explicit.
    • How exactly is the Dice loss computed during training? Are the images thresholded?

    • While the paper makes significant strides in empirical performance, the potential for reproducibility is limited by the absence of complete training and hyperparameter details in the main text. This could be made clear when the code is released. However the lack of multiple runs does put into question the reproducibility of the results as stated in the paper
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Major revisions required. The core contributions are compelling and could significantly impact ambiguous medical image segmentation. However, clarity issues in the derivations, incomplete descriptions in the experimental section, and gaps in reproducibility must be addressed. The presentation and formatting also require careful proofreading to remove typographical errors and placeholders. Inclusion of a statistical significance analysis would further strengthen the empirical results.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors proposed a new mthod for quantifying the uncertainty in medical image segmentation via the Schrödinger bridge. The method is based on the Schrödinger bridge problem, which finds the most likely path between two distributions, in this case, the image distribution and the segmentation distribution. Authors also propose a new metric, “Diversity Divergence Index”, to capture the diversity of the segmentation results. The method achieves state-of-the-art performance on LIDC-IDRI and COCA and in-house datasets.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is well-written and easy to follow.
    • The formulation of the proposed method is clear, and the algorithm is well-structured.
    • The proposed method is evaluated on multiple datasets and achieves state-of-the-art performance.
    • The proposed metric is simple yet effective in capturing the diversity of segmentation results.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • However, in the context of ambiguous segmentation, authors aim to preserve the structural information of the image, where conditional diffusion models fail to cpture. But it’s actually that the noisy, ambiguous regions of the image are the cause of the ambiguity in the segmentation. So, is it contradictory to use the Schrödinger bridge to find the most likely path between two marginal distributions?

    • The Schrödinger bridge aims to find the most likely path between two marginal distributions, which is much more difficult than the conditional diffusion model. How is training time compared to the conditional diffusion model?

    • What does the number “3” in the equation (6) mean?

    minor: Typo in abstract: “Bridge Bridge” should be “Bridge” Missing citation in 26. Reference to equation in algorithm 2 is missing “??”

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think the paper is good, but the motivation of SB hasn’t convinced me yet.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

Our thanks to R1, R2, R3 for their insightful comments and for recognizing the novelty and impact of our method. We have addressed your points individually below, as well as rectified overall typos or grammatical errors in the paper. We hope our revisions described below resolve the issues raised by the reviewers and meet the expectations.

[R1] We evaluated the statistical significance of all metrics improvements by applying two‐sample t‐tests across our full test set. Due to space constraint, these were not in the original manuscript, but we present some of them here. For example, we report (mean, std) for GED metric as follows: CCDM (0.264,0.003), SSB (0.245,0.003), and SSB++ (0.208,0.0025), with all pairwise comparisons against CCDM yielding p«0.001. Though SSB is generative, its variance is low (GED 0.245+/-0.0038 on 3 successive runs), thus confirming that the bridge’s stochasticity captures semantic alternatives without numerical drift.

[R1,R3] We agree with R1 that SSB++ performance can’t be discounted because it performs better even before the optimization, but to further clarify R3’s comments, SSB is the first method to begin diffusion from the raw image instead of noise, we started with previously used setting, using the same U-Net architecture, classifier-free guidance schedule, and 16 samples per image (standard for all the models including ours) for fair comparison—it reduces GED by 15-25% and improves DDI while requiring 20x fewer functional evaluations than a 1000-step diffusion. Next, for further optimized performance, we explored parametric space to assess stability of the model without changing the sampling scheme or using extra training data and with only small U-Net tweaks (wider channels, segmentation-focused normalization and extra iterations), SSB++ sharpens boundaries without growing model size, and remains stable across hyperparameter changes but improves in the performance.

[R3] It is also robust to distributional shifts, as our home data (RACER) encompasses scans from different scanners and demographic groups where SSB and SSB++ both surpass the existing literature. Our parameter-exploration experiments reveal that toggling hyperparameters within+/-20% range changes GED by <=0.0072 and DDI by <=0.0065, indicating strong stability and robustness.

[R2] Here we clarify the motivation for SB framework and how it addresses limitations of the existing diffusion-based models (CIMD,CCDM) that begin from pure Gaussian noise. These models wash out fine anatomical details and collapse every sample toward a single “best guess,” so they cannot reliably quantify boundary ambiguities. They have a high computational cost of thousands of steps to recover structure, making them slow and expensive. Our approach reimagines how diffusion can help with unclear boundaries by starting from the actual image instead of pure noise. In the forward pass, we add just enough randomness around uncertain edges to highlight where the model is not sure, but we never wipe out the underlying anatomy. Then, in the reverse pass, we rebuild a crisp mask step by step—at each stage injecting fresh, controlled noise so that each run explores a different but equally plausible segmentation. This “image to mask” bridge is not at odds with existing score-based methods—it adds a structural anchor to them—so we retain their flexibility while gaining a clear view of uncertainty. Hence, at each time when controlled noise is injected the model traverses’ different random path and yielding unique variations in the mask, even though they converge to the exact same marginals.

[R1,R2,R3] Full YAML configuration files (lr, \beta-schedule, CFG weight, Dice-loss coefficient, etc) are in the repository, along with a one-command script that reproduces every table/figure; we will release the open source code upon acceptance, currently it’s private due to the anonymous review process, and also the rebuttal system does not allow external links.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Reviewers acknowledge that the proposed diversity metric is interesting and encourage the authors to further refine the submission in the camera-ready version.



back to top