Abstract

In medical image segmentation, manual annotation is an exceptionally costly process, highlighting the critical need for selecting the most valuable samples for labeling. Active learning provides an effective solution for selecting informative samples, however, they faces the challenge of cold start, where the initial training samples are randomly chosen, potentially leading to suboptimal model performance. In this study, we present a novel cold start active learning framework based on Segment Anything Model (SAM), which leverages the zero-shot capabilities of SAM on downstream datasets to address the cold start issue effectively. Concretely, we employ a multiple augmentation strategy to estimate the uncertainty map for each case, then calculate patch-level uncertainty corresponding to the patch-level features generated from SAM’s image encoder. Then we propose a Patch-based Global Distinct Representation (PGDR) strategy that integrates patch-level uncertainty and image features into a unified image-level representation. To select the samples with representative and diverse information, we propose a Greedy Selection with Cluster and Uncertainty (GSCU) strategy, which effectively combines the image-level features and uncertainty to prioritize samples for manual annotation. Experiments on prostate and left atrium segmentation datasets demonstrate that our framework outperforms five state-of-the-art methods as well as random selection in various selection ratios. For both datasets, our method achieves comparable performance to that of the fully-supervised method with only 10% and 1.5% annotation burden. Code is available at https://github.com/Hilab-git/SUGFW.git

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1739_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/HiLab-git/SUGFW.git

Link to the Dataset(s)

Promise12 dataset: https://promise12.grand-challenge.org/Home/ UTAH dataset: https://www.cardiacatlas.org/atriaseg2018-challenge/atria-seg-data/

BibTex

@InProceedings{MaXia_SUGFW_MICCAI2025,
        author = { Ma, Xiaochuan and Fu, Jia and Zhong, Lanfeng and Zhu, Ning and Wang, Guotai},
        title = { { SUGFW: A SAM-based Uncertainty-guided Feature Weighting Framework for Cold Start Active Learning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15961},
        month = {September},
        page = {579 -- 588}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper addresses the problem of cold start in active learning for medical image segmentation. The authors propose a novel framework that leverages the zero-shot capabilities of the Segment Anything Model (SAM) to improve the selection of informative samples for annotation. Specifically, SAM is used to generate pseudo-masks, estimate uncertainty maps, and weight image features based on this uncertainty. The experimental results on two medical image segmentation datasets demonstrate the effectiveness of the proposed framework in comparison to several baseline methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed framework is technically sound and well-motivated.

    2. The use of SAM to address the cold start problem is novel.

    3. The ablation study provides insight into the contribution of different components of the framework.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The greedy selection algorithm is not new and prior work is not appropriately acknowledged.

    2. The main experimental result is problematic.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. Page 2, 1st paragraph, “As for uncertainty estimation, the initial performance of the model during the cold start phase is not satisfactory, leading to inaccurate results.” Why? Need more explanation and evidence.

    2. Computational cost for using boostrapping with SAM.

    3. Page 4, stability score threshold 0.95. Ablation study is needed for this empirical value.

    4. Sect. 2.3, the method is exactly the same as what was proposed in this paper:

    • Zheng H, Yang L, Chen J, Han J, Zhang Y, Liang P, Zhao Z, Wang C, Chen DZ. Biomedical image segmentation via representative annotation. InProceedings of the AAAI Conference on Artificial Intelligence 2019 Jul 17 (Vol. 33, No. 01, pp. 5901-5908).
    1. Table 1, why only 5~15% is reported? Are the performance saturated after 15%?

    2. Fig. 6, The variation is too large and does not make sense. Why adding more training data for some baselines leads to much worse results?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. Selection algorithm is not new, and prior work was not well acknowledged.

    2. The computational cost for using SAM to generate uncertainty.

    3. Main experimental results are problematic.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes an active learning framework based on the features and predictions of the Segment Anything Model to address the cold start issue.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper addresses a well-known problem in Active learning: cold start

    • The AL approach relies on knowledge from a foundation model to compute uncertainty rather than on the trained model that can be unstable or mis-calibrated.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • “we also applied E_img to X_i, thereby obtaining the patch-level image features F_i.” It is unclear to me what the patch-level features F_i are. SAM’s image encoder generates features of shape 256x64x64 for the entire image. Hence, what do the patches correspond to? And what is their size?

    • Can you explain eq. 3 in more detail?

    • Although the number of clusters should have an important impact on the AL selection, the method to select the parameter is not stated.

    • While comparisons with recent baselines are provided in Table I, since random is a strong baseline, adding comparison to AL methods based on uncertainty or hybrid (given random initialization) would be good to have. For example: “Dropout as a Bayesian approximation: representing model uncertainty in deep learning.” Gal et al. ICML (2016) “Learning loss for active learning.” Yoo et al. CVPR (2019)

    • It is not clear if the main results other than random sampling are also averaged other several runs with different initialization seeds.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • Beware of typos: “shpae”, eq. 7 has missing parenthesis, etc.
    • The phrase “minimum distance to the uncertainties of all currently selected samples” is unclear. An uncertainty is a value, not a position.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Clarifications are required to make the approach more understandable. Several typos too.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors introduce SUGFW, a novel active learning framework designed to address the cold start problem in medical image segmentation. (1) It leverages the zero-shot capabilities of SAM and data augmentations to generate epistemic uncertainty estimates and feature representations for unlabeled samples—without requiring initial manual annotations. (2) It further employ a patch-level fusion of image features and corresponding uncertainty. (3) After initialization, a greedy clustering-based selection strategy iteratively chooses the most valuable samples for annotation, which are then added to the training set.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Methodological Novelty The combination of SAM and k-times augmentation to bootstrap uncertainty estimation and feature extraction is novel in the context of initializing active learning. The proposed sampling strategy extends the standard uncertainty + representativeness + diversity paradigm, where: 1) uncertainty is quantified at the image level, 2) representativeness is derived through uncertainty-guided feature weighting, and 3) diversity is achieved via clustering.

    2. Interpretability The figures are clear and effectively illustrate both the workflow and the experimental results, enhancing the interpretability of the method.

    3. Validity The method is validated on two public MRI datasets (prostate and left atrium), and results are compared against multiple state-of-the-art active learning baselines, demonstrating strong performance.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Limited Dataset Diversity: Evaluation is restricted to two 2D MRI segmentation tasks. This limits the generalizability of the method to other modalities (e.g., CT) or 3D segmentation tasks.

    2. Limited Experimental Validity: Section 3.1 does not mention the use of cross-validation or repeated runs with different random splits, even for the relatively small Promise12 dataset (778 slices). This raises concerns about the robustness of the reported results.

    3. Figure Clarity: Figure 3 shows visual comparisons of different active learning methods, but the caption and text do not specify which iteration or sampling ratio the visualizations correspond to. Including this information would improve clarity and help prevent misinterpretation.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. Spelling Issues. In Section 2.1, page 4, “shpae” -> “shape”
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty of integrating SAM with approximated epistemic uncertainty, the applicability of the proposed sampling strategy, and the clear structure and figures contributed to the overall score.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors’ reply clarifies most of my comments. The experimental setup replicates that of previous SOTA methods on two MRI image datasets, validating the proposed method. However, the authors still haven’t clarified the active iteration (i.e., the sampling ratio or amount of labeled data used) for the visualizations in Fig. 3. This lack of transparency regarding the visualization settings, though unintentionally, raises a concern about potential cherry-picking.




Author Feedback

We sincerely thank all reviewers for their positive/constructive comments, where they described our framework as “novel” (R1&R2&R3), “stable” (R2), “interpretable” (R3) and “technically sound” (R1). The primary concerns are summarized and addressed below.

  1. Differences between our “GSCU” and “representation annotation (RA)” [AAAI-19] (R1): Although both works used greedy algorithm, the differences lie in (1) per-step strategy, where we select the sample with an uncertainty farthest from current set, while RA [AAAI-19] chooses the sample that maximizes the coverage score. (2) Optimization implementation, where we aim for a uniform distribution of uncertainty, while RA’s objective is still to maximize coverage score. Note that “GSCU” is only the final stage in our framework, the main contribution of our method focus on patch-level uncertainty generation based on SAM and “PGDR”. We’ll briefly clarify these differences in our manuscript.

  2. Computation cost for using SAM to generate uncertainty (R1): The computational consumption of bootstrapping mainly lies in the uncertainty calculation, which only requires for sample selection and will not be needed for the inference. With a V100 GPU in SAM’s ViT-B backbone, for each image, the above calculation costs 10.4 TFLOPs and 8 seconds, for the Promise12 dataset, the total time is 1.73 hours, and for the UTAH dataset, it amounts to 13.15 hours, which are affordable in most situations.

  3. Large variations in Fig2 (R1): The vast majority of methods exhibiting abnormal fluctuations (e.g., ProbCover, ALPS, CALR) are built upon general cv/natural images/medical language and are not entirely suitable for our medical images that have high variability, low contrast, class imbalance and domain-specific semantics. Prior MICCAI work [12] also involves these methods into the comparison, and obtained a similar finding as our study regarding the large variations.

  4. Results of > 15% labeled data (R1): Yes, the performance will be saturated when using larger than 15% labeled data. We observed the Dice was 86.39%, 86.63%, 86.45%, 86.59%, 86.77% and 86.47% when the annotation ratio was 15%, 20%, 30%, 70%, 80% and 90%, respectively.

  5. Ablation study of stability score threshold (R1): We made a grid search for this hyperparameter. When it was set as 0.85, 0.90, 0.95 and 0.97, the Dice was 81.26%, 84.91%, 86.63% and 86.42%, respectively. This shows that the best value was 0.95.

  6. Explanation for page 2, 1st paragraph (R1): The uncertainty generated by a random initialized model makes no sense, so this method is hard to solve cold start issue without a pretrained model. The corresponding sentence can be easily revised for clarification.

  7. Explanation of method (R2): (1) Patch and size: In SAM’s ViT-B backbone, the patch size is set to 16x16 to obtain 64x64 patches for an input image which is resized to 1024x1024, and each patch has a feature dim of 256. (2) Eq.3: We first average the k segmentation results, and then compute the uncertainty based on the averaged prediction. (3) Number of clusters: The cluster number is the desired sample number for selection, i.e., if we choose n samples, the number of clusters is set to n. (4) We will release our code to help readers better understand the implementation details.

  8. Dataset and compared methods (R2, R3): We used two public datasets (Promise12: MICCAI 2012, UTAH: MICCAI 2018) for experiments, demonstrating the effectiveness of our method. More datasets will be considered in a journal extension. For each dataset, we followed the official data split for experiments. For compared methods, due to the space limit, we prioritize more recent methods over some outdated methods (ICML 2016 and CVPR 2019 as mentioned by R2). We will include them for more comprehensive comparison in future work.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors addressed most of the concerns and the reviewers agree on the soundness and novelty of the proposed solution



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top