Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Despite the success of deep learning in automatic medical image segmentation, it heavily relies on manual annotations for training that are time-consuming to obtain. Unsupervised segmentation approaches have shown potential in eliminating manual annotations, while they often struggle to capture distinctive features for low-contrast and inhomogeneous regions, limiting their performance. To address this, we propose UM-SAM, a novel unsupervised medical image segmentation framework that harnesses Segment Anything Model (SAM)’s capabilities for pseudo-label generation and segmentation network training. Specifically, class-agnostic pseudo-labels are generated via SAM’s everything mode, followed by a shape prior-based filtering strategy to select valid pseudo-labels. Given SAM’s lack of class information, a shape-agnostic clustering technique based on ROI pooling is proposed to identify target-relevant pseudo-labels based on their proximity. To reduce the impact of noise in pseudo-labels, a triple Knowledge Distillation (KD) strategy is proposed to transfer knowledge from SAM to a lightweight task-specific segmentation model, including pseudo-label KD, class-level feature KD, and class-level contrastive KD. Extensive experiments on fetal brain and prostate segmentation tasks demonstrate that UM-SAM significantly outperforms existing unsupervised and prompt-based methods, achieving state-of-the-art performance without requiring manual annotations.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2296_paper.pdf

SharedIt Link: https://rdcu.be/eHwZm

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04984-1_59

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{FuJia_UMSAM_MICCAI2025,
        author = { Fu, Jia AND Li, He AND Lu, Tao AND Zhang, Shaoting AND Wang, Guotai},
        title = { { UM-SAM: Unsupervised Medical Image Segmentation using Knowledge Distillation from Segment Anything Model } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15967},
        month = {September},
        page = {616 -- 626}
}

Reviews

Review #1

Please describe the contribution of the paper

The work proposes a new method, UM-SAM, for unsupervised medical image segmentation. UM-SAM leverages the segment anything model (SAM) with shape-based filtering and ROI feature clustering methods to generate pseudo-labels (masks) for a specific task dataset. These masks are then used train a lightweight student model via knowledge distillation from SAM using a combination several losses: cross-entropy, dice, class-level feature and class-level contastive losses. The work evaluates the proposed method on 2 medical segmentation datasets: (i) Fetal Brain (FB), and (ii) Promise12 dataset. The proposed method achieves good results in comparison to prompt-based and unsupervised baselines, but still lags behind fully-supervised results.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Novelty: to the best of the reviewer’s knowledge, the proposed method is novel. The shape prior-based filtering is an interesting method to filter poor masks generated by SAM. Moreover, it is interesting (and quite surprising) to see how knowledge distillation can significantly boost the performance of segmentation results, even though the student model is trained using noisy pseudo-ground truths, and the teacher model (SAM) is itself inaccurate.

Impressive results on evaluation datasets: The results of the proposed method across the 2 datasets are clearly much better than the baselines, which is commendable. Nevertheless, as stated in the weaknesses, thse two datasets used are small and highly specific, and it is unclear whether the proposed method will generalize to more diverse datasets and tasks.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Limited evaluation (datasets): a major weakness of this paper is the lack of evaluation on diverse set of medical segmentation tasks, even though the proposed method is general. Moreover, the 2 chosen datasets only contain a single class mask label (brain for FB and prostate for Promise12). This makes it extremely difficult to interpret the generalizability of the proposed method, especially in multi-class settings. I suggest the authors consider evaluating their method on comprehensive medical segmentation benchmarks such as Medical Segmentation Decathlon [1] or MedSegBench [2].

Limited evaluation (baselines): While the work acknowledges and compares against some “prompt-based” baselines, such as SaLIP and MedCLIP-SAM, there are several other significant works and baselines missing. In my opinion, one of the biggest baselines missing is MedSAM [3], especially as the work leverages the original non-medical SAM. Other minor baselines include works already cited by the authors (DeepCluster, level set, DeepCut), as well as other works not cited such as CUTS [4].

Selection of hyperparameters requires validation set: while the proposed method is claimed to be unsupervised, there are several hyperparameters that need to be chosen specific to the dataset, which ultimately requires a validation set. First, the shape filtering parameters Vmin, Vmax, Amin, Amax. Second, the knowledge distillation hyperparameters lambda1, lambda2, and training hyperparameters (learning rate, momentum etc.). Consequently, it is unclear whether the proposed method truly is “unsupervised”.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

It is unintuitive to me why knowledge distillation significantly boosts performance (as shown in the ablation study Fig. 3b). Can the authors please comment on how and why knowledge distillation enables the student model to learn the segmentation task given the noisy pseudo-label ground truths?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the proposed method in the paper is novel, and the results demonstrated on the small and quite niche medical datasets are promising, the work is severely lacking in demonstrating its ability to generalize to other medical segmentation tasks, therefore limiting its significance and impact. Moreover, it is unclear whether the proposed method is truly unsupervised given the tuning of hyperparameters on validation set.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

I thank the authors for their response. I will stick with a borderline reject for this paper due to the bit of vagueness into hyperparameters in the framework. This combined with the evaluation on only one dataset casts some doubts into the robustness of the methods.

Review #2

Please describe the contribution of the paper

This work proposes an unsupervised segmentation approach that leverages the Segment Anything model (SAM) for pseudo-label generation and applies knowledge distillation to train a lightweight segmentation model.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Strengths:
1. The motivation behind the algorithm is clearly described, and the overall writing is easy to follow.
2. The performance improvement over existing unsupervised methods is significant and, notably, even outperforms prompt-based segmentations in some cases.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Weaknesses and Questions:
1. I have some concerns regarding the problem formulation, particularly the hyperparameter tuning involved. For instance, how are V_min, V_max, A_min, and A_max determined for each dataset? It seems these values are tuned based on the mask size distribution or task type — e.g., using the range of mask_size from the training set — which implies prior access to ground truth masks. This assumption contradicts the core idea of unsupervised learning. If the ground truth masks are not used, then how do the authors determine the expected size range of target objects for each dataset or task? If these hyperparameters are tuned using a small set of annotated samples, then this setting would be closer to weakly- or semi-supervised learning rather than fully unsupervised.
2. Similarly, how are the hyperparameters for SAM in “segment everything” mode chosen? As shown in Fig. 10 of [1], this mode can be adapted via the grid point density, producing masks from larger to smaller regions. How do the authors ensure that the generated mask corresponds exactly to the target object rather than fragmented pieces that must be combined? Moreover, [1] notes that the segment-everything mode tends to work better on clear-boundary, round-shaped objects — which is often not the case in medical imaging. This raises concerns about the generalizability and effective scope of the proposed approach.
3. I was also a bit surprised by the results showing that the unsupervised model even outperforms SAM with box prompts. For instance, in Fig. 2, SAM with a box fails to capture the target boundary, while the proposed method succeeds. Could the authors provide more details about their implementation setup and also elaborate on why the proposed method might outperform SAM with strong priors like box prompts? [1] Maciej A. Mazurowski, Haoyu Dong, Hanxue Gu, Jichen Yang, Nicholas Konz, Yixin Zhang, Segment anything model for medical image analysis: An experimental study, Medical Image Analysis, 2023
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

My questions are about the hyperparameter selection strategy, which is my major concern.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.
1. I still feel this method would easily fail on similarly shaped or sized objects, especially in 3D where structures start and end gradually across slices. Without strong priors, I don’t think it can reliably separate such regions — it likely crashes when similar-looking objects appear.
2. I’m honestly surprised the authors say it’s common to tune on the test set. Even if using a validation set to select a model/parameter is somewhat acceptable, relying on labeled data for hyperparameter tuning and model training makes it feel not truly unsupervised. If you’re already assuming labeled cases exist at training time, why not just use them for training and compare with semi-supervised methods?

Review #3

Please describe the contribution of the paper

The paper introduces UM-SAM that uses SAM for pseudo-label generation and a triple knowledge distillation strategy to train a lightweight segmentation model, achieving state-of-the-art results on fetal brain and prostate segmentation without manual annotations.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The motivation of the paper is solid. Reducing the annotation burden is a critical issue for medical image segmentation, and it makes sense to use SAM while incorporating a lightweight approach. Experiments on two public datasets demonstrate its practical impact, and ablation studies confirm the effectiveness of its components.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Using the SAM encoder as a feature extractor appears to be a more natural choice. Please clarify the motivation for choosing DINO over the SAM encoder.
2. Please specify whether a 2D or 3D UNet was used during the segmentation training stage.
3. To my understanding, SAM achieved relatively poor performance on prostate segmentation. Was any preprocessing applied to improve its performance in this study? If so, please detail the preprocessing steps.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

See above.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The author’s answer basically solved my confusion.

Author Feedback

We sincerely thank the reviewers for their insightful and constructive comments, and describing our work as “novel, interesting (and quite surprising), easy to follow” (R1), with a “clearly described” (R2) and “solid” (R3) motivation, addressing a “critical issue” (R3), and achieving “commendable” (R1) and “significantly improved” (R2) results. We address the main concerns below:

Annotated validation set (R1&2) In unsupervised learning, as no annotations are available for training, hyperparameter tuning remains a key challenge, as highlighted in SETGO (ICLR’22) and SmooSeg (NeurIPS’23). Following the practice of machine learning, it is common in existing methods to use a labeled set for this purpose, e.g., DeepCluster (ECCV’18) uses a validation set, and CUTS (MICCAI’24) and DiffSeg (CVPR’24) even tune on the test set. To the best of our knowledge, no unsupervised image segmentation (UIS) methods can automatically determine hyperparameters for any datasets without a labeled validation/testing set. Our work used a validation set for hyperparameter selection in all methods for fairness. Importantly, ground truth was never used in training or pseudo-label generation. Notably, weakly- and semi-supervised methods use partial labels for training and a fully annotated validation set for hyperparameter tuning, not directly using the validation set for back-propagation. Overall, the term “unsupervised” is used to describe the training set, thus, our experimental setting following existing works is reasonable.

Evaluation datasets (R1&2) Our current work focuses on binary segmentation, aligned with recent UIS studies (e.g., SaLIP, MedCLIP-SAM), due to the inherent difficulty of multi-class medical image segmentation without any supervision. We appreciate the suggestions regarding evaluation on MSD and MedSegBench, however, we haven’t seen existing UIS studies using them, and will consider them in the future.

Prior knowledge (R1&2) -The prior knowledge (Vmin/Vmax and Amin/Amax) is from anatomical knowledge, e.g., typical organ size and aspect ratio from clinicians, not derived from dataset-specific labels. -For SAM’s everything mode, we evaluated multiple grid densities and selected values that produce segment sizes approximately matching the expected target scale, without using any ground truth for tuning.

Compared methods (R1) -As MedSAM is trained with full annotations from Promise12, comparing it with our fully unsupervised method would be unfair. We will compare with MedSAM on non-overlapping datasets in future work. -Due to space constraints, we prioritize comparing with more recent and highly relevant methods (e.g., DINO, SaLIP, MedCLIP-SAM) over some outdated methods like DeepCluster and level set. We appreciate the suggestions and will compare more methods in future work.

SAM’s lower performance (R2) Due to the domain gap between natural images and medical images, SAM sometimes fails in medical imaging. In contrast, our method adapts better to the task using pseudo-labels and knowledge distillation (KD), which inherits feature representation ability of SAM and is further enhanced by the task-specific feature distribution.

Why KD improves performance under noisy pseudo-label (R1) Prior studies (e.g., Feature Normalized KD, ECCV’20) show that L2-norm-based KD with temperature scaling improves robustness to label noise. Besides pseudo-label KD, we introduced class-level feature KD and contrastive KD, the latter further encourages intra-class consistency and inter-class separability.

Motivation for using DINO (R3) As reported in CryoSAM (MICCAI’24), DINO provides richer semantic features than SAM’s image encoder, so we adopt DINO to extract ROI features for pseudo-label selection.

Implementation details (R3) We used a 2D UNet as the segmentation backbone. For Promise12, we applied intensity normalization by clipping values to the [0, 99] percentile range during preprocessing. Such details can be easily clarified in the paper.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

Despite reviewers recognize that this work has some merits, they have identified important concerns, mostly related to the hyperparameters selection (how they were selected, on which dataset, etc), and generalizability to other medical problems/datasets, which strongly limits the practicability of this work due to the arguably easier complexity of the task (i.e., binary segmentation problem). I side with the reviewers in that the classification of the method should be revisited should the authors fine-tune the hyperparameters in an external/independent validation set. Furthermore, reviewers have stressed the lack of comparison to more relevant baselines as a major concern in the empirical validation. I strongly recommend the authors to clarify these, and other points raised by the reviewers in a rebuttal.
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This paper received mixed scores after the rebuttal. The main weaknesses are the unclear hyperparameter settings, limited evaluation on a single dataset, and concerns about the method’s reliability on complex cases. Upon reviewing the manuscript and the rebuttal, I agree that it is not ready for publication in its current form.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Paper 2296 introduces UM-SAM, a pipeline that lets “segment-everything” SAM generate pseudo-masks, prunes them with simple shape thresholds, and then distils the filtered masks into a lightweight U-Net with pixel-, feature-, and contrastive losses. On two single-class datasets—fetal-brain MRI and PROMISE12 prostate MRI—the student network markedly outperforms earlier unsupervised or prompt-based baselines, though it still trails full supervision.

Reviewers agree the idea is conceptually appealing, yet two of the three ultimately recommend rejection and their main objections remain unresolved after rebuttal.First, the evidence base is narrow: only two tasks are reported, both binary, and several pertinent baselines (e.g. MedSAM, CUTS, DiffSeg) are absent; the authors acknowledge this and defer broader testing to future work. Second, the method’s “fully unsupervised” label is weakened by the need to tune multiple shape thresholds, loss weights, and grid densities on a labelled validation set—an experimental detail not disclosed in the manuscript. Third, generalisability is uncertain: the shape filter assumes compact, roughly ellipsoidal targets, and no results are shown for multi-class or anatomically complex cases. The rebuttal clarifies hardware, tuning practice and the intuition behind knowledge distillation, but — per conference rules — cannot add fresh experiments; consequently the key reservations above persist.

Balancing one Accept against two confident Rejects and considering the limited scope and undisclosed reliance on validation labels, the AC recommends Reject.

back to top

UM-SAM: Unsupervised Medical Image Segmentation using Knowledge Distillation from Segment Anything Model

Author(s):