Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Annotation variability remains a substantial challenge in medical image segmentation, stemming from ambiguous imaging boundaries and diverse clinical expertise. Traditional deep learning methods producing single deterministic segmentation predictions often fail to capture these annotator biases. Although recent studies have explored multi-rater segmentation, existing methods typically focus on a single perspective\textemdash either generating a probabilistic ``gold standard’’ consensus or preserving expert-specific preferences\textemdash thus struggling to provide a more omni view. In this study, we propose DiffOSeg, a two-stage diffusion-based framework, which aims to simultaneously achieve both consensus-driven (combining all experts’ opinions) and preference-driven (reflecting experts’ individual assessments) segmentation. Stage I establishes population consensus through a probabilistic consensus strategy, while Stage II captures expert-specific preference via adaptive prompts. Demonstrated on two public datasets (LIDC-IDRI and NPC-170), our model outperforms existing state-of-the-art methods across all evaluated metrics. Source code is available at \url{https://github.com/string-ellipses/DiffOSeg}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/5223_paper.pdf

SharedIt Link: https://rdcu.be/eHw8n

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05169-1_13

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/string-ellipses/DiffOSeg

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ZhaHan_DiffOSeg_MICCAI2025,
        author = { Zhang, Han AND Luo, Xiangde AND Chen, Yong AND Li, Kang},
        title = { { DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15972},
        month = {September},
        page = {128 -- 138}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper presents DiffOSeg, a diffusion-based framework for medical image segmentation in the presence of multiple annotators. Experimental evaluations are conducted on the LIDC-IDRI and NPC-170 datasets, comparing DiffOSeg with several existing methods.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is clearly structured and generally well-written.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Overstated Motivation: The paper claims that prior work typically focuses on either generating consensus segmentations or preserving rater-specific annotations, thereby failing to provide a comprehensive “omni-view.” However, recent studies such as PADL [1] and TAB [2] already aim to unify these perspectives by jointly modeling both consensus and expert-specific segmentations. Notably, TAB is also included in the experiments. Therefore, the motivation as stated appears overstated and insufficiently supported by a thorough literature review.
2. Limited Technical Novelty: Several key ideas in DiffOSeg appear similar to those in prior work: (1) The use of an importance vector to generate consensus segmentation resembles the expertness vector concept introduced in MR-Net [3]. (2) The prompting-based mechanism for preference-driven segmentation shares similarities with the learnable preference query in TAB [2]. Without a clearer differentiation or technical advancement beyond these methods, the novelty of DiffOSeg remains limited.
3. Potential Misuse of Dataset: The LIDC-IDRI dataset may be unsuitable for evaluating expert-specific segmentation performance. To assess rater-specific outputs, the dataset must provide reliable correspondence between each segmentation mask and its annotator. However, LIDC-IDRI does not include consistent rater identities across samples, making it unclear how annotator-specific evaluations were conducted. For instance, “Annotator A1” for Sample 1 may not be the same person as “Annotator A1” for Sample 2. This undermines the reliability of the reported expert-specific results.
4. Use of Inappropriate Evaluation Metric: The GED is designed to measure how well the predicted distribution matches the ground truth label distribution. It is not an appropriate metric for evaluating the accuracy of consensus segmentation results. The choice of evaluation metric should align with the evaluation objective.
[1] Liao, et al. “Modeling annotator preference and stochastic annotation error for medical image segmentation.” Medical Image Analysis 92 (2024): 103028. [2] Liao, et al. “Transformer-based annotation bias-aware medical image segmentation.” International conference on medical image computing and computer-assisted intervention. Cham: Springer Nature Switzerland, 2023. [3] Ji, et al. “Learning calibrated medical image segmentation via multi-rater agreement modeling.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overclaim of Motivation. Lacking Technical Novelty. Inappropriate Dataset. Inappropriate Metric.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have addressed the major concerns, so I recommend accepting the paper.

Review #2

Please describe the contribution of the paper

In medical image segmentation, the core challenges stem from target boundary ambiguity and expert annotation variability (originating from inherent image indistinctness and divergent clinical preferences). Existing methodologies can be categorized into three paradigms: meta-segmentation generates pseudo-gold standards through annotation fusion yet faces theoretical controversies; diversified segmentation learns multi-expert consensus distributions while neglecting individual styles; personalized segmentation captures expert-specific preferences but sacrifices population consensus. Conventional approaches fail to unify consensus-driven and preference-guided segmentation demands, while methods like D-Persona encounter performance limitations due to restricted latent space expressiveness and inadequate modeling of expert correlations. This paper proposes DiffOSeg, a dual-stage collaborative framework based on categorical diffusion probabilistic models: 1) The probabilistic consensus stage integrates distributional diversity of multi-expert annotations through implicit population consensus modeling to enhance segmentation generalizability; 2) The learnable preference prompting stage employs plug-in prompt modules to encode expert-specific styles, achieving personalized segmentation via dynamic denoising guidance. With 13.4 million parameters, this framework demonstrates computational efficiency (54% reduction compared to D-Persona) while achieving state-of-the-art performance in both consensus-driven and preference-guided tasks on LIDC-IDRI and NPC-170 datasets. Main contributions include: establishing the first diffusion model-driven unified paradigm for consensus-preference segmentation; proposing probabilistic consensus strategy and adaptive prompting mechanism; validating advantages in anatomical consistency and computational efficiency across cross-dataset scenarios.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Annotation errors and annotation style learning have long constituted domain-specific challenges in MIA, representing critical pathways toward expert-infused learning. Research endeavors in this direction hold substantial scientific merit and warrant prioritized investigation as a pivotal future research direction.
2. The proposed DiffOSeg demonstrates the exceptional capability of DDPM in excavating annotation errors and stylistic variations (technically considered as “erroneous” annotations from a model optimization perspective). This training paradigm effectively balances multiple flawed annotations to achieve consensus, thereby enhancing model performance.
3. From clinical translation perspectives, personalized expert prediction and annotation style assessment represent a highly promising direction, as such stylistic variations are not clinically erroneous but rather demand precise characterization of individual annotation patterns. The incorporated prompt-based module in this work addresses this challenge through implicit style characterization, aiming to faithfully preserve each expert’s unique annotation signature.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The comparative experiments exhibit incompleteness, particularly in the consensus-driven segmentation phase, where critical comparisons are lacking against noise-tolerant label learning methodologies (DOI:10.1016/j.media.2024.103166) and generative model-based approaches (arXiv:2301.11798). This omission substantially undermines the methodological confidence and validation rigor of the presented work.
2. Insufficient interpretation of annotation styles persists, especially regarding the NPC-170 dataset. From clinical translation perspectives, divergent annotation patterns often encapsulate critical diagnostic implications, notably distinguishing borderline lesions from reactive changes, which inherently carry substantial clinical significance. Systematic characterization and explicit modeling of these stylistic discrepancies would not only enhance clinical interpretability but also amplify the translational value of the proposed framework in real-world diagnostic decision-making scenarios.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
1. The study addresses a novel problem with rationally designed experiments, and while the comparative analysis remains incomplete, the empirical results moderately demonstrate the performance superiority of the proposed method.
2. The resolved challenges exhibit profound clinical translational potential, particularly in bridging the gap between computational modeling and clinician-centric diagnostic workflows.
3. The methodological innovation appears limited in scope, though the framework effectively demonstrates the representational capacity of the DDPM training paradigm for consensus-preference disentanglement.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

DiffOSeg is a diffusion-based segmentation framework that addresses inter-expert annotation variability by producing both a population consensus mask and expert-specific masks within a single model. It combines a probabilistic mechanism for fusing multiple annotators’ masks with a prompt-based conditioning scheme that reproduces individual annotator styles. Evaluations on multi-expert datasets indicate that DiffOSeg outperforms several existing multi-rater segmentation baselines across the reported metrics, suggesting improved handling of consensus and personalized outputs in one framework.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper tackles a known gap in multi-annotator segmentation by providing a single framework that addresses both common ground and individual differences among experts. Previous works typically handled either consensus (e.g., creating a single fused mask) or personalized outputs for each expert, but not both.
- The use of a diffusion probabilistic model for segmentation is well-motivated and represents a state-of-the-art technique for modelling complex output distributions. By adopting a categorical diffusion approach, the method can learn the full distribution of plausible segmentations rather than predicting a single deterministic mask. This is particularly powerful for modelling inter-expert variability. The first stage introduces randomness in combining annotations instead of assuming a fixed gold standard, producing diverse yet reasonable consensus segmentation variants.
- Instead of using separate expert-specific output heads or entirely separate models, DiffOSeg introduces a plug-in prompt block to modulate the model for each expert. The prompts are learned and encode the expert’s style. This is much more parameter-efficient compared to separate output heads / separate models.
- The experimental results are a major strength. DiffOSeg achieves state-of-the-art segmentation accuracy on both datasets for both consensus and individualized outputs. The paper includes qualitative results (e.g. examples of multiple sampled segmentations and uncertainty maps), which help illustrate that the method captures the variability in ambiguous regions while maintaining accuracy. The ablation study (probabilistic consensus vs. simpler pooling, prompt-based vs. simpler alternatives) further strengthens the paper.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The proposed solution is complex, involving a two-stage training procedure and a diffusion model with many sampling iterations. Training two stages (first for consensus, then fine-tuning for personalization) further increases the compute requirements. In the clinic, given the compute requirements, it’s unclear to what extent the benefits of the proposed method would outweigh much simpler deterministic baselines (e.g. nn-Unet). The authors might consider exploring distillation in future work.
- While the paper leverages cutting-edge techniques, some components are incremental advances rather than entirely new. Indeed, the paper could do a better job highlighting which aspects are truly novel versus borrowed. For instance, how does their approach fundamentally differ from D-Persona?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors successfully combine ideas from probabilistic modeling and conditional segmentation to tackle a challenging problem of practical importance (handling annotation variability). The proposed DiffOSeg framework fills a gap in the literature by providing a single model that can output either a consensus segmentation or individual expert segmentation as needed. The methodology is backed by clever engineering (the consensus weighting scheme and prompt-based conditioning). Experimental results are convincing, with DiffOSeg consistently outperforming strong baselines on multiple metrics, and gains in segmentation diversity modeling (much lower GED) and accuracy (higher Dice) demonstrate real value.

While some components are incremental or borrowed, these concerns do not outweigh the strengths of the method. The evaluation is also quite comprehensive, including ablations and appropriate baselines, which adds confidence to the results.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We sincerely thank the reviewers for insightful feedback and are encouraged by their recognition of our: 1. innovative diffusion-based unified framework for consensus-preference segmentation (R2, R4); 2. clinically valuable problem formulation with strong translational potential (R2, R4); 3. rigorous experimental design and validation (R2, R4). The major concerns are addressed as follows: Q1. Experimental Settings (R3, R4): 1) R3(Dataset & Metric): As stated in Section 3.1, “We simulated expert preferences on the four provided annotation regions following the setup in [24,29].”, following prior works, we simulated expert preferences by manually ranking the four annotations by area. Section 3.4 specifies that “We employ two complementary metrics: the Generalized Energy Distance (GED) and Threshold-Aware Dice (Dice_soft) [24], which are used to assess the diversity and fidelity of segmentation results, respectively.”，where GED primarily quantifies the internal diversity (variability) of segmentation outputs while Dice_soft evaluates accuracy; 2) R4 (Comparisons): We focus on probabilistic consensus segmentation methods (Prob. UNet/PhiSeg/D-Persona) aligned with Stage I’s objective, thus excluding noise-robust methods targeting label cleaning rather than distribution modeling; We appreciate the suggestion to include additional generative model comparisons, which would strengthen the study. This comparison will be implemented in our revised version due to rebuttal constraints. Q2. Method (R3, R2)：1) R3 (motivation): We respect the contributions of PADL and TAB, however, DiffOSeg offers fundamental difference: its probabilistic consensus differs from PADL/TAB’s deterministic consensus. DiffOSeg generates a probability distribution covering multi-expert consensus, extending the probabilistic consensus hypothesis in Prob. UNet. PADL employs deterministic mapping to model \(μ = F_{μ}(f_{img}; θ_{μ})\), outputting μ as the meta-segmentation map, essentially representing a point estimate and does not capture distributional diversity. The sampling is only used during training (to optimize the networks via reparameterization tricks) and inference (to predict μ, discarding variance). The ‘omni-view’ denotes our unified paradigm integrating probabilistic consensus and personalized modeling, benchmarked against most typical works in the field. Sincerely appreciating your observation, we will revise the expression ‘a gold standard consensus’ to ‘a probabilistic gold standard consensus’ for more precision. 2) R3 (technical novelty): Our major innovation is the first diffusion-based framework unifying consensus and preference modeling through thoughtful design, achieving both strong performance and parameter efficiency. The probabilistic consensus strategy is a concise yet effective approach to enhance expert distribution modeling, validated through ablation studies. For personalized segmentation, DiffOSeg employs data-driven implicit learning: 128-channel dynamic prompts input are adaptively compressed to 4 expert-specific channels during training. In contrast, TAB employs fully explicit learning using fixed-dimension parallel queries matching expert count. We will clarify further in revision. 3) R2: Thanks for your positive assessment and suggestions. Our work is inspired by D-Persona and aim to explore a more effective paradigm (e.g., upgrading from the VAE framework to DPMs) for modeling multi-expert annotations distribution, while optimizing its redundant multi-expert-head design. Experimental results demonstrate significant performance gains with parameter efficiency. Q3. Clinical explanation (R4): We are grateful for your constructive suggestion, which meaningfully expands our understanding. Our current approach models styles implicitly via dynamic prompts, and we will prioritize improved annotator variability characterization and utilization in future work.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

This work received mixed reviews, with the key concerns about its technical novelty and clinical explanations. The authors are invited to give a rebuttal to address these comments.
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

After reading the rebuttal, I agree with the consistent acceptance recommendation raised by three reviewers.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

DiffOSeg offers a well-motivated, well-engineered, and thoroughly validated approach to a complex and increasingly relevant problem in medical AI. It introduces meaningful innovations both conceptually and practically, and the rebuttal demonstrates thoughtful engagement with reviewer concerns.

Recommendation: Accept

back to top

DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model

Author(s):