Abstract

Ossicular chain lesions can cause hearing loss, making accurate segmentation of ossicles critical for clinical diagnosis and treatment. Ultra-high-resolution computed tomography (U-HRCT) provides quality images for ossicle segmentation tasks, but the complex structure of the stapes and variations in annotators’ experience often lead to noisy labels in 3D annotation within clinical practice. To address this, we propose a novel framework tailored for two types of noisy labels: (1) incomplete-structure labels, and (2) complete-structure but inaccurate labels. For the former, we introduce a Dilating&Selecting (D&S) framework, which completes missing structures using a dilating Volumetric Discrete Diffusion Refiner (VDDR) with a novel cover loss and evaluates label completeness via a completeness selection strategy. For the latter, we introduce a noise-based augmentation to better train VDDR. Experimental results demonstrate that D&S framework reduce the time cost of manual annotation by 90.2%, while VDDR outperforms other state-of-the-art methods. To facilitate further research and development, our code and two datasets are publicly available.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3007_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Flq2002/3dOssSeg/tree/master

Link to the Dataset(s)

N/A

BibTex

@InProceedings{FanLin_Noisy_MICCAI2025,
        author = { Fan, Linqian and Zhang, Mengshi and Wang, Yonghao and Lu, Wenkai and Yin, Hongxia},
        title = { { Noisy Label Refinement Based on Discrete Diffusion Process in 3D Ossicle Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15972},
        month = {September},
        page = {423 -- 433}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper addresses the challenge of noisy annotations in 3D ossicle segmentation, particularly for ultra-high-resolution CT (U-HRCT) images of the stapes. It proposes a novel Volumetric Discrete Diffusion Refiner (VDDR) for label refinement, which is model-agnostic and operates in a discrete diffusion space. The authors also introduce a Dilating & Selecting (D&S) framework for handling two types of noisy labels: (1) incomplete-structure labels and (2) complete-structure but inaccurate labels. Extensive experiments demonstrate the effectiveness of this framework, showing significant annotation time savings and improved segmentation performance. Additionally, the authors release two new annotated datasets (OSS-I and OSS-C) along with their code.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Innovative use of discrete diffusion in label refinement: VDDR creatively applies discrete diffusion for progressive label correction, an approach that is both novel and potentially generalizable to other noisy label settings in medical imaging.

    2. Time-efficient annotation framework: The D&S framework is designed to reduce manual annotation workload by 90.2%, which demonstrates strong practical value in clinical workflows where precise labeling is time-consuming.

    3. Public data release: The release of OSS-I and OSS-C datasets, including both noisy and refined labels, provides a useful resource to the community for further research on label noise and refinement.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Unclear motivation and inconsistencies in problem setup: The introduction and Figure 1 are confusing and raise several unresolved questions about the design of the study. Specifically:
      • Why do OSS-I and OSS-C follow different annotation guidelines but share the same ground truth concept?
      • Are the two GT examples in Figure 1 derived from the same image (Fig.1c)? If so, why do they appear structurally different?
      • How exactly is the ground truth (GT) defined and validated? If reliable GT is available, why wasn’t it annotated directly from the start?
    2. Ambiguity in dataset construction: The criteria for what constitutes a “noisy” versus “accurate” label are not clearly described, making it difficult to assess the effectiveness of the proposed correction mechanisms.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I gave a Weak Reject because while the proposed VDDR and D&S framework address an important and realistic problem with innovative ideas, the paper suffers from critical clarity issues in its motivation, problem setup, and figure presentation. The inconsistencies in how the datasets and ground truths are defined—especially as illustrated in Figure 1—are confusing and not sufficiently justified. Additionally, if high-quality ground truth labels are indeed available for evaluation, it is unclear why they could not be used for direct model training, undermining the central motivation of the label refinement pipeline. These issues collectively weaken the credibility and reproducibility of the work, despite its promising methodological contributions.

    With improved clarity on dataset design and GT definition, along with stronger comparisons and better articulation of key components, this work could have strong potential in future revisions.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    Overall, the paper presents an interesting idea, but the current version suffers from conceptual and presentational clarity issues that I believe would require substantial revision to meet MICCAI standards. Given that major changes are not permitted post-submission, I must recommend rejection at this time.



Review #2

  • Please describe the contribution of the paper
    1. The paper adapts discrete diffusion-based label refinement specifically to 3D medical segmentation, introducing an additional image-encoder pathway. This approach, called VDDR, aims to denoise/refine noisy segmentation masks by modeling them as a discrete state that is gradually “eroded” in the forward process and then “reconstructed” in a reverse diffusion process.
    2. The paper also proposed a specialized two-step pipeline for a real-world scenario: (a) Dilating VDDR: Expands stapes structures in incomplete-structure labels (OSS-I dataset) by training on other labels (OSS-C). (b) Completeness Selection: Automatically identifies the best checkpoint to balance underfitting vs. overfitting, thereby selecting the iteration that yields the most reliable “completion” of missing stapes parts.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Ossicle segmentation from ultra-high-resolution CT is a fine-grained task where label noise is pervasive. The proposed method addresses incomplete or erroneous labels and provides evidence that the pipeline can help reduce expert workload.
    2. Building on the discrete diffusion approach, the paper successfully transfers that idea to 3D medical imaging. The authors add an image-encoder module to better extract relevant 3D features, suggesting it outperforms naive 2D or simple channel-concatenation solutions.
    3. Instead of requiring a perfect ground truth for all training data, the approach leverages partial or inaccurate labels from two distinct sets. This is pragmatic for medical imaging, where perfect labels are scarce. The selection mechanism, monitoring completeness probability, is a straightforward but effective idea for checkpoint tuning.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The essential discrete-diffusion-based refiner architecture is heavily inspired by Wang et al. [22]. Most of the novelty here lies in the 3D volumetric adaptation plus an image encoder. While these adaptations are valuable, they are not a fundamental methodological leap in diffusion modeling.
    2. The D&S pipeline trains a “dilating VDDR” on OSS-C, which is said to have complete but potentially inaccurate structure. Its success depends on how error-prone those “complete” masks are. If they are substantially wrong, the method might overfit or incorporate incorrect shapes. The “completeness selection” step partly mitigates this but is still reliant on those partial labels.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper addresses a highly relevant clinical labeling problem and demonstrates that the proposed method reduces annotation burden while improving segmentation. Although the discrete-diffusion-based “refiner” is not entirely new, the authors adroitly adapt it to 3D medical segmentation and add a pragmatic pipeline that handles real-world annotation challenges. Hence, the method is well-motivated, thoroughly tested, and offers a practical solution to noisy labeling challenges, but does not constitute a major leap in fundamental methodology.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposes a novel framework for refining noisy labels in 3D ossicular chain segmentation from ultra-high-resolution CT (U-HRCT) images. The authors introduce the Volumetric Discrete Diffusion Refiner (VDDR), a conditional diffusion model tailored for label refinement. Two specific types of noisy labels are addressed: (1) incomplete structure labels, which are handled using a Dilating & Selection (D&S) framework with a novel cover loss; and (2) complete but inaccurate labels, which are augmented with synthetic noise to improve training robustness. The proposed method significantly enhances segmentation accuracy across multiple backbones and datasets, and reduces annotation effort by over 90%.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Clinically targeted problem formulation: The paper addresses a highly specific and clinically important challenge—noisy labels in small-structure ossicle segmentation from ultra-high-resolution CT—which is often overlooked and difficult to annotate accurately. Well-adapted discrete diffusion model: The proposed VDDR extends discrete diffusion models into the volumetric medical segmentation domain via conditional refinement, making it a suitable and application in this context. Innovative Dilating & Selecting (D&S) strategy: The integration of dilating VDDR with a completeness selection mechanism and a novel cover loss demonstrates originality in tackling incomplete annotations without relying on clean ground truth. Commitment to reproducibility: The public release of datasets and code significantly enhances the work’s transparency, usability, and value to the community.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While the method is well-tailored for noisy label refinement, many components (e.g., the discrete diffusion process) are adapted from prior work like SegRefiner. The paper would benefit from clearer distinctions and theoretical justifications for the proposed extensions (e.g., cover loss, dilating strategy). The evaluation is limited to a custom ossicle dataset, and no experiments are conducted on public benchmarks. This raises concerns about the generalizability and reproducibility of the approach. Some components, such as the UPS thresholding or augmentation settings, are heuristic and not fully explored via ablation or analysis.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a practical and well-executed framework for refining noisy 3D segmentation labels in ossicle segmentation from ultra-high-resolution CT scans. The proposed Conditional Volumetric Discrete Diffusion Refiner (VDDR), combined with a Dilating & Selecting (D&S) mechanism and a novel cover loss, effectively addresses two common types of noisy labels: incomplete and inaccurate annotations. The framework is tailored to the anatomical and imaging challenges of ossicles and shows strong improvements over multiple segmentation backbones. The method is particularly valuable in scenarios where precise manual annotation is difficult and time-consuming.

    While the diffusion backbone is based on prior work (e.g., SegRefiner), the extensions made—such as volumetric adaptation, task-specific label dilation, and cover loss—are well-motivated and technically sound. Some methodological choices (e.g., thresholds, augmentation strategies) are empirically driven and could benefit from deeper theoretical analysis or ablation. In addition, the evaluation is conducted solely on a custom dataset, without benchmarking on public datasets, which somewhat limits the assessment of generalizability.

    Nonetheless, the paper addresses a clearly defined and clinically relevant problem with a thoughtful and effective solution. The contributions are sufficiently novel and well-validated to merit acceptance.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    After reviewing the authors’ rebuttal and carefully re-examining the paper, I find that the authors have sufficiently addressed the major concerns raised by all reviewers, particularly those regarding the motivation, dataset setup, and interpretation of label noise.

    Clarification on annotation policies and GT rationale: The authors provided a clear explanation of why OSS-I and OSS-C follow different annotation protocols while still aligning to the same anatomical ground truth. They clarified that ground truth was obtained post hoc through model-assisted correction and expert validation, which justifies the refinement strategy. On dataset construction and label accuracy: The authors clarified the definitions of “noisy” and “accurate” labels and addressed the concern about label mismatch or overfitting by emphasizing their use of morphological priors, cover loss, and selective training via completeness probability to mitigate risks of overfitting to incorrect annotations. Methodological merit and applicability: While the VDDR model builds on prior work (SegRefiner), the adaptation to volumetric data, use of an image encoder, novel cover loss, and the D&S framework constitute a meaningful contribution tailored to a clinically relevant task. The method is well-validated, reduces manual labeling time significantly (by over 90%), and improves segmentation across multiple backbones. Reproducibility and utility: The open release of code and datasets greatly strengthens the community value and transparency of this work. Although the paper’s novelty is primarily in adaptation and system integration rather than in proposing fundamentally new algorithms, its practical impact, clear clinical motivation, and strong experimental evidence justify acceptance.




Author Feedback

We thank the reviewers for their valuable feedback. We address the comments below and will revise the camera-ready version accordingly upon acceptance.

R2 had questions about our motivation, problem setup and dataset construction. We address these concerns by responding to the weaknesses. For 1(1), we clarified that OSS-I and OSS-C are both stapes CT images and the anatomical structure of the stapes is relatively consistent. Therefor, they share the same ground truth concept. The reason why they follow different annotations is due to the tiny structure in 3D views (only contains 1-2 pixels in some slices) and low signal-to-noise ratio in CT imaging. As a result, the annotators need to decide whether or not to segment the uncertain pixels. For 1(2), in Figure 1, the GT in (b) is derived from (c), while the GT in (f) is not. OSS-I and OSS-C are not paired, meaning they come from different CT images. The response to 1(3) also addresses our core motivation. We define GT as the segmentation results that are deemed satisfactory by relevant expert annotators. It is relatively easy for experts to judge whether a result is acceptable, but it is much more difficult and time-consuming for them to manually segment satisfactory results. So we wanted to reduce the labeling burden by our methods. The GT were obtained based on the model prediction and some were manually corrected by experts with much less labor time cost as described in section 3.3 and section 3.4. So we did not annotate them from the start. For 2, “accurate” label shares the same concept with GT, while “noisy” is opposite of “accurate”, we use noisy label to align with the concept in deep learning. We will clarify these points in the Introduction section upon acceptance.

R3 raised two weaknesses: (1) limited methodological novelty and (2) the impact of the “substantially wrong” masks. For (1), while our method builds on Wang [22], we emphasize that (a) to our knowledge, VDDR is the first model-agnostic refiner in 3D medical image segmentation. It flexibly corrects noisy labels via tailored loss functions (e.g., cover loss) and morphological operations. We highlight the application innovation and potential of VDDR. (b) The Dilating & Selecting framework is novel and shown effective for OSS-I refinement. In practice, incomplete-structure noisy labels are common because poor CT image quality often leads to partially unsegmented regions in the initial annotation. For (2), in theory, “Substantially wrong” masks are rare, as annotations follow structural priors and image-label alignment is strong (e.g., in Fig. 1(d), the red annotation is unlikely to extend much beyond the white bone area). Even in such cases, “dilating VDDR” takes an image and its corresponding noisy label as inputs in the inference stage. It will dilate the noisy label as the iteration step grows and wouldn’t erode it because of the cover loss applied in training. Because the VDDR includes an image encoder focused on image understanding, the model does not immediately adapt to inaccurate predictions.

To address R1’s concerns on generalizability and reproducibility, we note: in principle, our method should generalize to CT images. Segmentation masks reflect intrinsic anatomical structures, which are generally robust to distribution shifts. Therefore, new datasets with similar clinical labeling should be compatible with our approach. We did not use public benchmarks because we could not find any that match the characteristics of datasets like OSS-I and OSS-C. We have released the code and will release the dataset upon acceptance to support reproducibility. Also for R1, after adding noise-based augmentation, the Dice increased from 75.31 to 77.14 (Table 2). The thresholding process is detailed at the end of Section 2.3. We emphasize that “a sudden increase” indicates that the classifier is observing sufficiently complete and structured labels, leading to high-confidence prediction.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper receives the mixed reviews. After reading the paper, reviews and rebuttal, I agree with R1 that this is a solid system paper with high practical utility, particularly for its domain-specific adaptation, strong validation, and real-world applicability. Hence, recommend acceptance.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This work proposed a discrete diffusion-based label refinement method for medical image segmentation. Despite that the method has some practical value, the reviewers pointed out that it does not have a fundamental method novelty. In addition, the writing needs to be improved. The motivation and problem setting up should be revised.



back to top