List of Papers Browse by Subject Areas Author List
Abstract
Semi-supervised learning (SSL) can effectively reduce the labor-intensive labeling required for deep learning based medical image segmentation. The emergence of visual foundation models show zero-shot capability, offering a new way of SSL. In this paper, a novel SSL framework that combines foundation and dedicated models is proposed. Unlike most existing SSL methods, where the foundation model is manually prompted to generate pseudo-labels from unlabeled images for training the dedicated model in a one-way strategy without further refinement. In our framework, foundation (SAM2) and dedicated (UNet) models are in an iterative pipeline. Specifically, in each iteration, prompts from coarse segmentation results using UNet are calculated for SAM2 to generate pseudo-labels which are used to further train the UNet for better prompts in next iteration. In this way, the pseudo-labels and UNet can be mutually improved until convergence. To enhance the performance of SAM2 in medical image segmentation, a new uncertainty-aware module using historical cues is presented to optimize key frames selection and prompts generation for SAM2. Furthermore, a new semantic-aware memory bank is introduced, where memories in the memory bank of SAM2 are divided into semantic groups. In this way, anatomical prior knowledge can be leveraged by SAM2. In the experiment, our framework is evaluated using public and in-house datasets in the context of multi-label segmentation, and the experimental results demonstrate that our framework outperforms state-of-the-art SSL methods in both datasets.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0318_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
N/A
Link to the Dataset(s)
N/A
BibTex
@InProceedings{YinZim_Iterative_MICCAI2025,
author = { Yin, Ziman and Nie, Dong and Li, Shuo and Pan, Junjun and Tang, Zhenyu},
title = { { Iterative Foundation-Dedicated Learning: Optimized Key Frames, Prompts and Memories for Semi-Supervised Segmentation } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15967},
month = {September},
page = {257 -- 266}
}
Reviews
Review #1
- Please describe the contribution of the paper
The paper introduces an iterative approach for Semi-Supervised Learning, by leveraging the generality of segment-anything models and the specificity of the UNet architecture. The core idea is to train a UNet in the usual supervised way, but generating the ground truth masks automatically through SAM2, thus foregoing the need for costly manual annotations. The uncertainty of the UNet’s predictions is then used to prompt SAM2 to obtain a new iteration of ground truths, thus allowing an iterative refinement of the whole model by exploiting a synergy between the generic SAM2 model and the specific UNet model.
The paper addresses a well-known and valid issue in medical imaging applications, i.e. the lack of large amounts of high-quality annotations and the high cost of their manual generation by clinical experts.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Major strength 1 The authors devise a nifty way to leverage two powerful tools (SAM2 and UNet) by complementing their requirements. The proposed method represents a step forward w.r.t the individual approaches: UNets can be trained to obtain very good performance on a specific task, but they require high-quality ground truth annotations; SAM2 can segment anything but needs to be prompted. While using foundation models to generate pseudo-masks has already been explored in the literature, the idea of using the uncertainty from the UNet to prompt the “teacher” model in an iterative feedback loop is, to my knowledge, novel.
Major strength 2 The introduction of a semantics-aware module in the memory bank is a second major strength of the paper. In both testing datasets, this module makes a large difference for the bone segmentation task (although it seems slightly less impactful for the cartilages segmentation). This may appeal to a personal bias of mine but I am happy to see that the inclusion of domain knowledge in the learning approach yields significant improvements in performance. The idea of grouping memories according to their semantic content is interesting, and deserves to be further explored.
Major strength 3 The method introduces an automatic way to choose key frames based on the uncertainty of the UNet predictions. This idea allows to fully remove the need for human expert interaction, and represents a smart way of leveraging the available information in the system.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
There are a couple of methodological points that are unclear to me.
- I don’t understand how thresholding the confidence maps can lead to prompts. In particular, in relation to eq. (1), a pixel that was always labeled as background during all mc-dropout iterations should have U=0, C=1. Looking at eq (4), this pixel would end up having I_{mask} = 1, even though it is background. Wouldn’t this pixel influence the prompt generation in the wrong way?
- In eq. (5) a dilation operation is used. The authors later state that this is justified by the fact that cartilages are thin tissues attached to the bone. It seems like a very ad-hoc approach, and it makes me wonder how well this would generalize to segmenting other structures. If the authors tried the “trivial” definition of F_{cart}, i.e. the one that does not include the dilations and intersection/union operations, I would be interested in knowing what were the results and how they came up with this new formulation instead.
In terms of results, first of all I would like to commend the authors for providing an uncertainty measure in their reporting. However, please include more details: first and foremost, what is the uncertainty metric (standard deviation, I assume?) and secondly how was it calculated (different random seeds, cross-validation, different choices of the labeled subset, etc..)? A p-value analysis would also be appreciated, since at first glance it does not seem that most of the performance improvements are actually significant. Above all, the major weakness regarding the presentation of results in this paper is the lack of an external validation dataset. The small number of samples in the two datasets and the small number of data sources contribute to this weakness. Overall, we have no indication of how well this method generalizes to different imaging protocols, machines, operators, etc..
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Some minor comments:
- The fourth sentence in the abstract (beginning with “Unlike most existing …) seems to abruptly cut-off.
- In section 2.1 is there a reason for not including the “channel” dimension in the image dimensions?
- In eq (1), are the Y_{n} logits or do you apply sigmoid/softmax?
- Eq (5), what is F_{I}?
- Eq (5), is a single dilation iteration applied? Or how many? How does the image resolution affect this aspect?
Finally, a very minor comment because I understand that you are presenting a general methodology that could potentially be applied to other situations. Given the small number of images in your two datasets, how long would it take a human expert using SAM2 to manually prompt the image segmentation? It would be interesting to compare to this scenario as well, purely from a practical application point of view.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This paper has some interesting and impactful strengths, mostly in terms of the nifty ways in which the authors devised a method to “bootstrap” the generation of ground truth segmentation masks, and their inclusion of domain knowledge in the learning approach. However, the lack of a stronger analysis on the generalizability of the method (and therefore its clinical impact/usefulness) and some general doubts on the significance of the improved performance scores slightly reduce the quality of this work.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
This paper introduces an innovative semi-supervised learning (SSL) framework that integrates SAM2 and UNet in an iterative pipeline, enabling mutual refinement of pseudo-labels and segmentation performance. It features an uncertainty-aware module to optimize key frame selection and prompt generation for SAM2 using historical cues, and a semantic-aware memory bank that leverages anatomical prior knowledge by grouping memories semantically. Evaluated on public and in-house datasets for multi-label medical image segmentation, the framework demonstrates state-of-the-art performance, outperforming existing SSL methods.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1.The design of this paper fully leverages the characteristics of the SAM2 to address 3D image segmentation tasks. The proposed iterative pipeline takes advantage of the generalization capabilities of large models while incorporating domain-specific medical knowledge. This alternating optimization process enhances performance by iteratively refining results through the integration of both general and specialized information. 2.The writing is clear and logically coherent, providing a detailed explanation of the iterative training process and the principles behind each module.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1.The comparative experiments lack novelty in the methods being compared, as only one study from 2024 is included, while the rest are from 2022 or earlier. 2.The evaluation only includes experiments on two knee datasets, without testing on other 3D medical image datasets to validate the generalization capability of the method.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
None
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
None
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
Th work proposes an iterative semi-supervised framework for 3D MRI segmentation, incorporating three key components: an uncertainty estimation module, an optimized key-frame selection module, and a semantic-aware memory bank (SA-MB) in the SAM2 architecture.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The proposed framework is innovative in the context of semi-supervised segmentation learning, effectively leveraging an iterative pseudo-labeling process to make use of unlabeled data. Notably, the integration of a foundation model throughout the iterative process combined with a prompting strategy proves to be an effective approach, significantly contributing to the performance of the overall model. This solution is relevant for general segmentation tasks.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- It is not clear how the iterative process stops? how was defined the convergence stop criteria ? are 4 iterations sufficient? why, why not more ?
- The semantic-aware memory bank (SA-MB) is the most important part of the work from the ablation study results. However, is not clear why it is semantic-aware? their exaplanation in equation (5) is vague, with multiple notations not explained and no contextualized to how it is semantic-aware ?
- In the results. In the public dataset (SKI10), over iterations, there seems not improvement for bone segmentations, and even a decrease over cartilage segmentation. Only when SA-MB is included, there is an improvement of cartilage segmentation, but not a clear improvement in bone segmentation. This arises somes concerns about the convergence of the iterative process, and its stabiility. Although in in-house dataset that is not exactly the same case, the author should provide more arguments and discussion in this respect.
Minor remarks:
- It is not clear the segmentation labels, in different parts of the manuscript is mentioned that the work aims to segment “e.g bones and cartilage” or “Asume that Iseg,I is … l \in {bone, cart}”. The segmentation has only these 2 labels? If yes, this should be affimartivaly stated, more clear within the text (introduction, exp. setup), because the works appear to be too general as a segmentation tool, but actually only is tested in this segmentation task.
- Improve the training details, e.g crossEntropy and dice losses are combined ? how ?
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Add in future works that the proposed semi-supervised strategy can be extended to alternatives medical imaging segmentations tasks.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The novelty of proposed solution of using foundational model compared to common prompting. Also the computational framework that can be translated to other medical iamging segmentation tasks were data is scarce and SSL is demanding.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
We thank all reviewers for giving us valuable comments. We appreciate that they agree on the strengths of our work in terms of novelty (R1, R2, R3), the improvements of specific tasks (R1, R3), the significance of removing human interaction (R1), writing coherence (R2), and generalizability (R3). We will diligently revise and expand our current work based on the valuable feedback from all reviewers.
Meta-Review
Meta-review #1
- Your recommendation
Provisional Accept
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A