Abstract

Recent “segment anything” efforts show promise by learning from large-scale data, but adapting such models directly to medical images remains challenging due to the complexity of medical data, noisy annotations, and continual learning requirements across diverse modalities and anatomical structures. In this work, we propose SAMed-2, a new foundation model for medical image segmentation built upon the SAM-2 architecture. Specifically, we introduce a temporal adapter into the image encoder to capture image correlations and a confidence-driven memory mechanism to store high-certainty features for later retrieval. This memory-based strategy counters the pervasive noise in large-scale medical datasets and mitigates catastrophic forgetting when encountering new tasks or modalities. To train and evaluate SAMed-2, we curate MedBank-100k, a comprehensive dataset spanning seven imaging modalities and 21 medical segmentation tasks. Our experiments on both internal benchmarks and 10 external datasets demonstrate superior performance over state-of-the-art baselines in multi-task scenarios.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2056_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/ZhilingYan/Medical-SAM-Bench

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YanZhi_SAMed2_MICCAI2025,
        author = { Yan, Zhiling and Song, Sifan and Song, Dingjie and Li, Yiwei and Zhou, Rong and Sun, Weixiang and Chen, Zhennong and Kim, Sekeun and Ren, Hui and Liu, Tianming and Li, Quanzheng and Li, Xiang and He, Lifang and Sun, Lichao},
        title = { { SAMed-2: Selective Memory Enhanced Medical Segment Anything Model } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15972},
        month = {September},
        page = {541 -- 550}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper enhances the SAM-2 architecture for medical image segmentation by introducing two novel modules. First, it incorporates a temporal adapter into the image encoder to capture volumetric/temporal features, which is crucial for volumetric and sequential medical data. Second, it proposes a confidence-driven memory mechanism that retrieves relevant high-confidence features during inference, thereby improving robustness and mitigating catastrophic forgetting.

    To support training and evaluation, the authors also present MedBank-100k, a large-scale dataset comprising 21 segmentation tasks across 7 imaging modalities.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Please list the major strengths of the paper.

    One of the key strengths of this paper is the introduction of a confidence-driven memory mechanism, which extends the original memory design in SAM-2. Unlike SAM-2, which primarily leverages temporal correlations across sequential frames, the proposed approach also considers feature similarity and prediction confidence when selecting memory entries. By filtering memory features based on confidence and retrieving them based on similarity, this design ensures that the memory attention operates on more reliable and relevant representations. This is effective for medical image segmentation, where modality differences and annotation noise are common challenges.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1.Questionable effectiveness of the temporal adapter: The proposed temporal adapter aims to improve performance by using 3D convolution within the SAM-2 image encoder to capture spatial or temporal correlations. However, as far as we know, the SAM-2 image encoder is typically applied to single-frame inputs or individual slices of 3D data. Given that the input is still 2D (e.g., one slice at a time), it is unclear whether a 3D convolution can meaningfully capture any temporal or volumetric dependencies in this setting.

    2.Missing implementation details: The paper compares performance against multiple SAM-based models, which are originally designed for interactive segmentation. However, it does not specify what type of prompts (if any) were provided during inference. Moreover, the sentence “We compare our model against SAM [8], SAM-2 [16], MedSAM [13], MedSAM-2 [22], and U-Net [17], where U-Net is trained per task and others use official checkpoints” raises the concern that the SAM-based models may not have been fine-tuned on the same dataset, potentially making the comparison unfair or suboptimal.

    3.Limited scope of experimental design: Although the proposed model is built on SAM-2, most of the benchmark tasks remain conventional 2D segmentation tasks. The paper does not compare against strong 2D segmentation baselines such as nnU-Net, which is widely used in medical image segmentation. Furthermore, the improvements over U-Net on external datasets are marginal, especially considering that SAM-based models operate on high-resolution 1024×1024 inputs—significantly increasing computational cost. It is unclear whether the performance gain justifies this overhead.

    4.Ambiguity in the ablation study baseline: Since the proposed model builds upon SAM-2, it is assumed that the baseline (without added modules) should perform similarly to SAM-2. However, the results presented in the ablation study show the baseline performing significantly better than SAM-2 in the main comparison table. This discrepancy raises questions about whether there are additional influencing factors not clearly described in the paper.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    1.I remain skeptical about whether the proposed temporal adapter effectively captures spatial or temporal information as intended, especially considering the input setting of the SAM-2 image encoder.

    2.The experimental comparisons lack strong 2D medical image segmentation baselines, and the implementation details are not sufficiently described to ensure reproducibility or fair evaluation.

    3.The ablation study lacks clarity regarding the baseline setup—specifically, why the baseline results differ significantly from SAM-2 despite supposedly sharing the same core architecture.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The main contributions of SAMed-2 was extending the SAM-2 architecture with a temporal adapter to capture image correlations and a confidence-driven memory module to store high-certainty features for later retrieval. Additionally, they curated a large-scale diverse dataset, MedBank-100K, of 21 segmentation tasks and 7 modalities.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    A major strength of the paper is the extensions to the architecture. The temporal adapter within the encoder can capture temporal or volumetric information in medical images, going beyond previous “segment anything” models like SAM or SAM-2 that lack temporal awareness. This allows the model to leverage spatial-temporal context. Secondly, the paper incorporates a confidence-driven memory module that retains high certainty features during training and a confidence similarity retrieval strategy during inference. This helps in mitigating catastrophic forgetting across diverse segmentation tasks to ultimately improve noise robustness in large medical datasets. The paper curates the MedBank-100K dataset from 21 tasks and 7 modalities and shows consistent performance improvements across most tasks in both zero-shot and few-shot scenarios. Figure 4 also demonstrates that there is clinical utility to using SAMed-2 since it reduces annotation by 87.6% for a cardiovascular segmentation task without sacrificing accuracy. They show the model can provide practical value for clinical annotation.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    For the confidence-similarity memory retrieval formulation, it is unclear why they sum the cosine similarity and sigmoid confidence value and whether they are normalized and of similar scale. In the final text, it will be useful to see clarification about this choice along with an ablation study showing whether this summation is robust.

    Additionally, SAMed-2 shows limited gains or underperforms on several tasks like Liver Tumor, Adrenal Gland, or Polyp task compared to SAM-2 or an U-Net. Furthermore, despite SAMed-2 being a foundation model, the U-Net still performs comparably or better (e.g. Liver), warranting discussion. There is also little discussion about modality-specific failure cases, relevant for cases like echocardiography where the performance is much lower than its counterparts. It is also unclear whether the data preprocessing steps induced bias by favoring well-annotated, clear images when considering how SAMed-2 could be used with real-world clinical data.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the paper had thorough evaluation with extensive experiments across many tasks and modalities and was able to show improvement over SAM, MedSAM and U-Net for some tasks. This paper has compelling ideas and technical contributions like the temporal adapter and confidence-driven memory.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    After reading the author response and looking at the other reviews, I will stick with an accept due to the methodological contributions and sufficient evaluation.



Review #3

  • Please describe the contribution of the paper

    The authors proposed SAMed-2, a new foundation model for medical image segmentation built upon the SAM-2 architecture. To train the model, the authors develop a large dataset consisting of 122,594 frame-mask pairs. SAMed-2 introduced a new memory pool based on feature-level similarity and confidence. The method was evaluated on both internal and external tasks, where SAMed-2 demonstrates superior performance over original SAM and SAM 2 as well as MedSAM and MedSAM 2.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper demonstrates considerable effort in both data collection and model training, which contributes to its strong performance. Experiments are conducted across a wide range of datasets, and the ablation study is thorough, providing clear insights into the contribution of each component. Additionally, the inclusion of few-shot learning scenarios and a human user study further supports the practical value and applicability of the proposed method.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The implementation details for prompt simulation are missing, both during training and inference. It is unclear how prompts are simulated.

    2. The definition of “external dataset” lacks clarity. Although the authors claimed external datasets as ones “representing new patients, imaging conditions, and tasks.”, but it appears that some tasks may have been encountered during pre-training—such as liver CT segmentation, which overlaps with the BTCV dataset mentioned in the pre-training phase.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper demonstrates strong performance on popular benchmarks, including MedSAM and MedSAM 2, supported by thorough evaluations. My concerns are mainly clarification questions, which should be relatively straightforward to address.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors addressed my concern.




Author Feedback

We sincerely thank reviewers for their insightful comments. Below are clarifications:

Reviewer 1:

  1. Memory Retrieval formulation. Following your feedback, we clarified our design: we sum similarity and confidence to independently keep their individual contributions. Cosine similarity was normalized to [0,1] using ReLU to exclude negative correlations and match sigmoid confidence scale. To verify the robustness of summation, an extra ablation demonstrates up to 2.53% Dice improvement compared to using similarity alone. Comparing summation vs. multiplication also confirms summation performs better (Dice=0.6938 vs. 0.6816).

  2. Limited gains over SAM-2 or U-Net. U-Net results are after fine-tuning, while SAMed-2 results are zero-shot. Even without fine-tuning, SAMed-2 achieves higher average scores and outperforms U-Net on most tasks, including mentioned Liver (0.7738 vs. 0.7501), Adrenal Gland (0.2886 vs. 0.1869), and Polyp (0.8183 vs. 0.2391). After extra fine-tuning, SAMed-2 surpasses U-Net on Liver Tumor (0.4727 vs. 0.4528). SAMed-2 also outperforms SAM-2 on all tasks under the zero-shot setting. We confirm no selective preprocessing was used to favor well-annotated or clearer images.

Reviewer 2:

  1. Prompt implementation. Following MedSAM, we used bounding box prompts simulated from expert annotations with random perturbation (0–20 pixels). This applies uniformly across all SAM-based models for fair comparison.

  2. Definition of “external datasets”. External datasets are those completely unseen in pre-training. Despite possible task overlap, the data are from different institutions or patient groups, with clear domain gaps.

Reviewer 3:

  1. Effectiveness of temporal adapter. To clarify, each batch in SAMed-2 exclusively contains one type of data: shuffled 2D images, sequential video frames, or sequential slices from a 3D volume (no combination of different types within a batch). For 3D data, slices maintain sequential order along the batch dimension. Before applying the 3D convolution, we transpose this batch dimension into the convolution’s depth dimension. The 3D convolution kernel enables interaction between adjacent slices, thus capturing volumetric or temporal dependencies. After convolution, dimensions are restored. Table 4 in the manuscript confirms improvements (up to 10.34%) versus baseline without temporal adapter. To validate the effectiveness of the 3D convolution, we performed an additional ablation study by replacing the 3D convolution with the 2D convolution (no inter-slice interaction), observing a performance drop of up to 5.54%.

  2. Implementation details. As in MedSAM, we used bounding box prompts simulated from expert annotations with random perturbation (0–20 pixels). It applies on all SAM-based models for fair comparison. We use official checkpoints of SAM-based models to evaluate zero-shot capabilities, which is our primary focus. To address your concern, we fine-tune MedSAM and MedSAM-2 with our pretraining data. Results (mean Dice on external tasks: SAMed-2=0.6938, MedSAM=0.6616, MedSAM-2=0.5965) confirm SAMed-2’s superior generalization.

  3. Comparing nnU-Net. Our primary goal is to develop a foundation model with strong zero-shot performance, not purely to surpass specialized baselines. Nonetheless, for comprehensive evaluation, we additionally fine-tuned nnU-Net across 21 internal tasks, yielding lower mean Dice of 0.6488 versus SAMed-2’s 0.7118.

  4. Marginal external gains over U-Net. SAMed-2’s external results (Dice=0.6938) reflect zero-shot inference without fine-tuning, while U-Net’s results (Dice=0.6879) are after task-specific fine-tuning. For fairness, all models, including U-Net, used 1024×1024 inputs.

  5. Ablation baseline. Our ablation is to explore the impact of proposed modules (temporal adapter & memory). The ablation baseline is SAM-2 pretrained on medical data, different from original SAM-2. We confirm no additional differences except the pretraining data.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I recommend acceptance of this paper.

    The paper presents a well-motivated and technically interesting framework for medical image segmentation, incorporating novel components such as a temporal adapter and confidence-driven memory mechanism. These innovations are evaluated across multiple tasks and modalities, demonstrating strong performance on standard benchmarks, including MedSAM and MedSAM2. The experimental breadth and consistent improvements over baselines such as SAM, MedSAM, and nnU/U-Net highlight the potential impact and relevance of this work.

    That said, some concerns remain. Specifically, questions about whether the proposed temporal adapter truly captures spatiotemporal features as intended are not fully resolved, especially considering the nature of the SAM-2 image encoder. Additionally, the lack of strong 2D segmentation baselines, limited implementation details, and unclear baseline ablation design (particularly the performance gap with SAM-2 despite shared components) affect the transparency and reproducibility of the findings.

    Nevertheless, the compelling ideas, thorough evaluation, and potential for impact outweigh these concerns. With improved clarity and additional experiments in future work, this study can further benefit the community. I recommend acceptance.



back to top