Abstract

High-quality pixel-level annotations of medical images are essential for supervised segmentation tasks, but obtaining such annotations is costly and requires medical expertise. To address this challenge, We propose a novel segmentation framework that relies entirely on coarse annotations, encompassing both target and complementary labels, despite their inherent noise. The framework works by introducing transition matrices in order to model the noise in the coarse annotations. By jointly training on multiple sets of annotations, it progressively refines the network’s outputs and infers the true segmentation distribution, achieving a robust approximation of precise annotations through matrix-based modeling. To validate the flexibility and effectiveness of the proposed method, we demonstrate the results on two public cardiac imaging datasets, ACDC and MSCMRseg, and further evaluate its performance on the UK Biobank dataset. Experimental results indicate that our method achieves segmentation performance that surpasses state-of-the-art weakly supervised methods and closely matches fully supervised approach. Moreover, our method offers a promising pathway for making it feasible to train large medical segmentation models, e.g., MedSAM \cite{medsam}, with minimal manual labeling effort while maintaining high performance.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0555_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{DuAng_RefineSeg_MICCAI2025,
        author = { Du, Anghong and Aung, Nay and Arvanitis, Theodoros N. and Piechnik, Stefan K. and Lima, Joao A. C. and Petersen, Steffen E. and Zhang, Le},
        title = { { RefineSeg: Dual Coarse-to-Fine Learning for Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15975},
        month = {September},
        page = {464 -- 473}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a new weakly-supervised medical image segmentation framework that can only using coarse annotations to train the network, moreover, transition matrices are introduced to model the inaccurate and incomplete regions in the coarse annotations. Extensive evaluations are performed on three public datasets to demonstrate the effectiveness of the proposed method.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper presents a new coarse-to-fine learning strategy to segment medical images, the adopted coarse annotations are easier to obtain compared with pixel-level annotations. Moreover, the authors introduce transition matrices regularizing to model and disentangle the complex mappings from input images to the coarse annotations and to the true segmentation distribution.
    2. The proposed method is evaluated on three public cardiac datasets, and the reported performance is quite well compared to the state-of-the-art methods.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The authors claim that the coarse annotations are obtained by eroding the ground truths from the datasets or manually annotating, then the label noise can be involved in practice, and the performance would rely on the initial coarse annotations, in this case, the robustness of the proposed method cannot be ensured.
    2. The difference between the proposed coarse annotations and other weak annotations such as scribbles is not clear, so if the so-called coarse annotations are as coarse as the scribbles are, the advantage of the proposed method is not convincing.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    1, The authors claim that the coarse annotations are obtained by eroding the ground truths from the datasets or manually annotating, then the label noise can be involved in practice, and the performance would rely on the initial coarse annotations, in this case, the robustness of the proposed method cannot be ensured. 2, The difference between the proposed coarse annotations and other weak annotations such as scribbles is not clear, so if the so-called coarse annotations are as coarse as the scribbles are, the advantage of the proposed method is not convincing.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    My concerns about the annotation have been addressed



Review #2

  • Please describe the contribution of the paper

    The paper introduces a novel weakly supervised segmentation framework that enables end-to-end joint training using both positive (target) and negative (complementary) coarse annotations. Unlike existing weak supervision methods, the proposed approach explicitly models and disentangles the complex relationships between input images, coarse labels, and the underlying true segmentation distribution. This is achieved by incorporating transition matrices as a form of regularization. The framework is evaluated on multiple benchmark datasets, including ACDC, MSCMRseg, and UK Biobank, where it outperforms existing weakly supervised methods with huge margin and achieves performance comparable to fully supervised models.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The use of a transition matrix in place of coarse segmentation labels within the loss formulations (Equations 3 and 4) is an interesting approach that adds a layer of regularization to the training process.

    The proposed method delivers substantial improvements over existing weakly supervised baselines and comaprable with ScribFormer demonstrating performance that approaches fully and semi-supervised models,

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The methodology section is difficult to follow, especially in terms of understanding the overall training pipeline. Providing a clear, end-to-end overview of the training process.

    2. In Table 1, the reported results for the ACDC and MSCMRseg datasets appear to have inconsistent or possibly flipped values for the right ventricle (RV) and myocardium (MYO) across the baselines. Clarification on this discrepancy would be helpful.
    3. It is unclear whether the performance improvements reported are statistically significant. Including significance testing or confidence intervals would strengthen the claims. 4.The model is evaluated using coarse mask annotations, but it is also compared against baselines that rely on other forms of weak annotations such as points, scribbles, or bounding boxes. This raises fairness concerns in comparison. While I understand the model is designed to go from coarse to fine segmentation, methods like scribbles or bounding boxes could also be adapted as coarse labels. If the authors can justify the comparison setup, it would help address this concern. 5.It would be valuable to include an analysis of cases where the model performs best and worst. Understanding failure modes could offer insights into the limitations and robustness of the proposed method. 6.There is a lack of ablation studies to isolate the contributions of key components. For instance, it would be informative to explore how the number of positive and negative coarse annotations affects model convergence and its ability to approximate the true segmentation distribution.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the performance improvement is notable, the concerns I’ve outlined particularly regarding the clarity and evaluation of the proposed method have influenced my decision.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose RefineSeg as a segmentation framework that takes course positive and negative annotations (weakly supervised) and apply the framework to estimate the true segmentation in cardiac images. The 2D UNet based architecture in Fig1 consists of a segmentation network that predicts the true segmentation for the input image, and an annotation network that predicts the transition matrices for the input image. The loss function consists of a positive label loss (CE and Dice for the annotated regions), a negative label loss (with additional linear transformation layer) and an identity matrix regularization loss for the transition matrices. The ACDC, MSCMRseg and UK Biobank (UKBB) datasets are used for the experiments. To obtain positive/negative coarse annotations, the segmentation masks are eroded following the approach in [3]. To evaluate segmentation performance, the Dice scores are reported for RV, LV and MYO. The proposed framework is shown to outperform the weakly supervised approaches and semi-supervised training approach with partial masks. Qualitative results are also provided for the various methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper tackles an important and challenging research problem of using course labels to train models, thus aiming to reduce costs related to expensive human labeling. Expert labels for medical images can be very expensive to get for large datasets.

    • The method is validated on publicly available datasets

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Typos:

    • “In this section, we describe how to joint learn the true segmentation distribution”
    • “To address this issue, [19] proposed restricts loss computation to only the annotated”
    • “features from different coarse annotations, esulting in a more comprehensive”
    • “represents the predicted truth label distribution”

    • More details of the network architecture should have been provided at the beginning of section 2. Currently, we have to infer the network outputs based on the section 2.2 where the losses are described. For example, it is mentioned later on in the paper that a 2D UNet is used, but when looking at Fig 1, there are many decoder branches. What do the blue and orange branches predict - Negative and Positive transition matrices perhaps? What is the value of “P” used in the experiments? Is this number related to the number of branches in the annotation network? Is the annotation network thrown away after training?

    • Ablations would make these claims stronger - “This demonstrates the effectiveness of jointly training with both positive and negative coarse annotations.” “The inclusion of negative coarse annotations further enhances the model’s ability to extract positive features while imposing stronger regularization leading to more robust feature representation”

    • Tab 1 contains just the dice scores, no distribution provided as done in Fig4. Why?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The writing in this paper is mostly clear - for example, in explaining the motivation behind the problem, but it can be improved in the technical parts (Section 2). The proposed method addresses an important problem of using course labels (that are much cheaper) to train models. Some of the details of the network architectures should be explained in the early, rather than later, parts of the paper. The evaluation is done against other methods on public datasets. The proposed method is shown to outperform all the other weakly supervised methods.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely thank all reviewers for their constructive feedback. We are pleased that all (R1, R2, R3) recognize the novelty and value of our proposed weakly supervised segmentation framework, RefineSeg, and its strong performance across two public cardiac datasets and UK Biobank dataset. Below we address key concerns and clarify misunderstandings where appropriate.

(1) R1 & R3 – Justify the fairness of comparisons with Scribble or Box annotations.

We thank the reviewers for the question. Our coarse annotations provide region-level supervision with approximate shapes, unlike scribbles or boxes that offer sparse or location-only cues. Scribble-based methods often rely on post-processing (e.g., ScribFormer’s random walk, which depends on closed-loop strokes), and box methods use annotations as auxiliary constraints rather than direct supervision. As such, these forms are not directly interchangeable, and our comparisons are fair given their inherent differences. We will clarify this distinction in the final version.

(2) R1 & R3 – Robustness to noisy or variable coarse annotations.

We appreciate the concern that the proposed method might be sensitive to how coarse labels are obtained. Our method is explicitly designed to handle annotation noise without assuming label accuracy. By modelling image- and pixel-dependent transition matrices with identity regularization, the model learns to correct noisy coarse labels rather than memorizing them. This enables it to infer the true segmentation distribution from diverse, imperfect inputs. We validate this robustness on both synthetic and real annotations (Figs. 2 & 3) and will clarify the strategy in the final version.

(3) R2 & R3 – Clarity of method and network architecture.

Thank you for the suggestion. The annotation network shares a 2D U-Net encoder with separate decoder branches for positive (blue) and negative (orange) transition matrices (Fig. 1). These matrices are only used during training; the segmentation network alone is used at inference. “P” indicates the number of annotation strategies (e.g., eroded and complementary). We will clarify these details in Section 2 and update Fig. 1 accordingly.

(4) R3 – Statistical significance and lack of error/failure analysis.

We agree on the importance of statistical validation. Due to the page limitation, in the extended journal version, we will report standard deviations or 95% confidence intervals and conduct paired t-tests to support our claims. We will also analyze failure cases on the UKBB dataset, highlighting both challenging and well-performing examples to better understand model limitations and guide future improvements.

(5) R2 & R3 – Need for ablation studies and analysis of annotation contributions.

We agree this is an important component for isolating contributions. To better support our claims, we will include ablations analyzing in the extended journal version : (i) training with only positive or only negative coarse labels. (ii) varying the number of annotation strategies P. (iii) evaluate the effect of identity matrix regularization by adjusting its strength.

(6) R3 – Clarification on Table 1-potential flip in RV/MYO results Thank you for catching this—values for RV and MYO in some baselines were misaligned due to a table formatting error. We will correct this to ensure accurate comparisons. This does not affect our conclusions, which rely primarily on stronger baselines such as ScribFormer. All reported typographical errors will be fixed.

We hope these clarifications and planned improvements address the reviewers’ concerns. We believe RefineSeg offers a generalizable and practical approach for weakly supervised segmentation in real-world medical imaging scenarios and respectfully ask for reconsideration of our paper for acceptance.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    The reviewers pointed out the importance of learning from coarse annotations, and the proposed method has some merits. The authors are invited for a rebuttal due to several concerns: 1) unfair comparison with other methods for scribble learning, as they can also be adapted to training with the same type of annotations as this work; 2) Insufficient ablation studies to show the contribution of each component; 3) Unclear method description; 4) label noise may be involved when generating the coarse annotation; 5) no significance analysis for the results

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper received mixed scores, and still after the rebuttal process is a borderline submission. Despite this, there seems to not be strong reasons to reject it, as most raised concerns (regarding clarifications on the empirical validation) are responded in the rebuttal. I recommend the acceptance of this work, and strongly encourage the authors to consider in the final version the constructive criticism raised during the review process, e.g., explaining the fairness of the empirical comparison (scribbles and bounding box methods). Furthermore, while I agree that adding additional datasets or ablations is not possible for the submission, I also believe that adding statistical values in the camera ready version is possible, and thus encourage the authors to do so.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper received mixed scores after the rebuttal. The main weaknesses are the lack of clarity in the methodological presentation and concerns regarding the experimental setup. Besides, the absence of ablation studies and statistical analysis limits the strength of the claims. Upon reviewing the manuscript and the rebuttal, I agree that it is not ready for publication in its current form.



back to top