Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

The Segment Anything Model 2 (SAM-2) has shown impressive capabilities for promptable segmentation in images and videos. However, SAM-2 primarily operates on visual prompts including points, boxes, and masks, which do not natively support text prompts. This limitation is particularly noticeable in medical imaging, where domain-specific textual descriptions are often beneficial for annotating subtle abnormalities and identifying regions of interest. In this paper, we introduce Text-Guided SAM-2 (TGSAM-2), a medical image segmentation model tailored to leverage text prompts as contextual guidance. We propose a text-conditioned visual perception module that conditions visual features on textual descriptions, and refine the memory encoder to track target objects using medical text prompts. We evaluate our method on four medical image datasets with video-like characteristics, including 2D image sequences (e.g. Endoscopy, Ultrasound) and 3D volumes (e.g. CT, MRI). Experimental results demonstrate that our method outperforms state-of-the-art models, including both image-only and text-guided medical image segmentation methods.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0846_paper.pdf

SharedIt Link: https://rdcu.be/eHw4F

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05127-1_54

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YuaRun_TGSAM2_MICCAI2025,
        author = { Yuan, Runtian AND Zhou, Ling AND Xu, Jilan AND Li, Qingqiu AND Chen, Mohan AND Zhang, Yuejie AND Feng, Rui AND Zhang, Tao AND Gao, Shang},
        title = { { TGSAM-2: Text-Guided Medical Image Segmentation using Segment Anything Model 2 } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {565 -- 574}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a method for medical image segmentation that leverages text prompts as contextual guidance within the Segment Anything Model 2 (SAM-2) framework. The authors introduce two main innovations: (1) a text-conditioned visual perception (TCVP) module that conditions visual features using textual descriptions, and (2) a text-tracking memory encoder (TTME) designed to ensure target consistency across sequential frames. The method is evaluated across multiple modalities including ultrasound, endoscopy, CT, and MRI, and demonstrates superior performance compared to existing task specific, text guided and point prompted state-of-the-art approaches.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Insightful application domain alignment - The authors astutely identify that ultrasound, endoscopy, and sequential CT/MRI slices share video-like temporal properties, aligning well with the strengths of the SAM-2 model originally trained on such data.

Novel formulation - The work cleverly integrates textual prompts into multiple components of SAM2—including the encoder, decoder, and memory module—allowing language to guide visual features throughout the segmentation pipeline. This tight integration of language and vision across stages introduces a novel mechanism for improving segmentation consistency and accuracy, particularly in video-like medical data.

Performance - Empirical results on multiple modalities indicate strong segmentation performance, improving over several baselines, which suggests that textual prompts can enhance semantic understanding and localisation .

Potential impact - The method has the potential to influence how contextual information (via language) is utilised in enhancing medical image analysis, especially in video-like sequences.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Clarity of architectural presentation - There is a mismatch between the description of the architecture in the text and the notations used in Figure 2. For example, the notation ‘T’ (used to denote video frames) and ‘P’ (text prompts) are not consistently aligned with annotations T in the figure, leading to confusion about the data flow. The figure should also annotate the Tproj component and indicate the location of the learnable projection layer. Furthermore, the formula describing how text embeddings and positional encodings are combined should be explicitly given. In Figure 3(b), it is not clear what happens with the top part of the figure. After you obtain Ms, it is connected to a further Conv layer to produce m_features and a positioning encoding to generate m_pos. Then to what modules are m_features and m_pos connected to.

Inconsistent notation - Figure 3 introduces variable names such as m_features and m_pos, diverging from the mathematical notation (e.g., 𝑀𝑠) used elsewhere in the paper. Consistency in notation throughout the paper would greatly enhance readability.

Experimental evaluation - The use of evaluation metrics is not fully justified. The authors should explain the omission of standard metrics such as the Hausdorff Distance, which are commonly used in segmentation tasks. The Dice and IoU metrics are central to the evaluation but are not formally defined in the text. The authors should clarify how temporal consistency or tracking ability is evaluated, and whether these 2D metrics sufficiently capture performance in a video-context setting.

Implementation details - Parameter settings and design choices (e.g., number of memory slots, dimensions of projections, etc.) are presented without sufficient justification. These could influence reproducibility and fairness of comparison.

Ablation study limitations - The paper would benefit from a clear baseline description that excludes both the TCVP and TTME modules in order to isolate their individual and joint contributions more effectively.

Prompt definition ambiguity - The paper should clarify what is meant by “without text prompts.” Does this refer to an absence of textual input, or merely to prompts lacking specific attributes (e.g., shape or position)?

Scope of generalisation and usage - It is unclear whether the model supports alternative types of prompts (e.g., image-based, bounding boxes), or if there are constraints on the length or structure of text prompts.

Computational efficiency - No analysis is provided regarding the computational cost or inference time of the model, which is relevant for clinical applicability.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a compelling and timely approach to enhancing medical image segmentation using text-guided prompts within a SAM-2-based architecture. The integration of temporal memory and language-driven context is both novel and promising, particularly given the strong empirical results across diverse modalities. However, there are significant issues with clarity in architectural description, notation consistency, and experimental justification that limit the interpretability and reproducibility of the work in its current form. These weaknesses prevent a stronger recommendation, but the paper’s contributions are nonetheless valuable to the community and merit consideration if these concerns can be addressed through revision.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

While I initially raised concerns regarding architectural clarity, notation inconsistencies, and experimental completeness, the rebuttal addresses these points satisfactorily. The authors clarified the architectural flow, acknowledged missing definitions and metrics, and provided reasonable justifications for parameter choices and ablation structure. Although a more detailed efficiency analysis would strengthen the work, their commitment to include this in the final version is appreciated.

Review #2

Please describe the contribution of the paper

The paper proposes a text-guided medical image segmentation framework based on SAM-2. It introduces a method for incorporating textual features into SAM’s prompt encoder and further integrates text features into the image encoder and memory bank of SAM-2 through two modules: Text-Conditioned Visual Perception (TCVP) and Text-Tracking Memory Encoder (TTME). The authors demonstrate the effectiveness and reliability of the proposed method through experiments conducted on datasets across multiple imaging modalities.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1.While incorporating textual features into the image encoder via cross-attention is a common practice, the authors go a step further by introducing text guidance into the memory bank through the proposed Text-Tracking Memory Encoder (TTME). This novel design enhances the model’s ability to retain spatiotemporal information in video-like medical data. 2.The authors conduct comprehensive experiments across four different imaging modalities, comparing their method with a broad range of baselines, including end-to-end segmentation models, text-guided segmentation methods, and interactive segmentation based on SAM. In addition, the ablation studies go beyond simple presence-or-absence module testing and delve into architectural design choices within the proposed modules, providing a more thorough evaluation of the method.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1.There may be potential issues in some experimental details. For example, in the TCVP module, the authors apply cross-attention between text features and image features, using the text feature [L,C] as the query and the image features as key and value. However, the output of cross-attention should match the shape of the query. In this framework, it is unclear how the text feature output can be directly summed with the image feature map, given their likely mismatch in spatial dimensions. No justification is provided for this potential inconsistency, such as shape alignment or broadcasting.

2.The paper introduces three ways to integrate text: through the prompt encoder, TCVP, and TTME. However, the ablation studies and discussion only focus on the effects of TCVP and TTME, while the role of the prompt encoder in contributing to overall performance is not evaluated or discussed.

3.The implementation details of the baseline methods are not elaborated, particularly for SAM-based algorithms. It is unclear whether the original SAM models are used directly or fine-tuned on medical data. Moreover, SAM’s performance under point prompts is highly sensitive to the point selection strategy during both training and inference. Thus, comparing the proposed method directly with point-based interactive segmentation methods may not be entirely fair. It would be more appropriate to also include comparisons with recent video segmentation methods guided by text, such as LoSh.

4.The proposed method adopts a large input image size of 1024×1024, consistent with SAM’s original setup. However, several baselines such as LViT and nn-UNet are typically trained on much smaller input resolutions. The performance gain brought by higher-resolution inputs should be discussed to ensure a fair comparison.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The primary reason for my overall recommendation is the thoroughness of the experiments. The authors evaluate their method across multiple imaging modalities, compare it with a wide range of baseline approaches, and conduct well-designed ablation studies. This comprehensive experimental validation strongly supports the effectiveness and robustness of the proposed method.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper introduce a medical segmentation model, Text-Guided SAM-2 (TGSAM-2), to leverage text prompts as contextual guidance. The main contribution involves a Text-Conditioned Visual Perception (TCVP) module and a Text-Tracking Memory Encoder (TTME), experiments on four medical image datasets demonstrate the superior of the method.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This paper enables text-prompted segmentation by incorporating text embeddings into the original SAM framework. By treating text as a form of sparse prompt, the proposed method effectively handles textual input.
2. The method enhances segmentation performance by integrating textual information, which provides valuable object attributes. The newly designed TCVP and TTME modules further leverage text input to improve segmentation results.
3. The effectiveness and generalization of the proposed method are demonstrated across four datasets of varying modalities (MRI,CT, Ultrasound and Endoscopy). The authors also conduct comprehensive ablation studies to further explore the effectiveness of the proposed method.
4. The code are open-sourced, which ensure the repeatability of the proposed method.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. In Section 2.1, the authors should clarify the motivation for treating text embeddings as sparse embeddings. In the original SAM2 framework, both sparse and dense embeddings are utilized, and text embeddings may intuitively seem more akin to dense embeddings. Therefore, the rationale behind this design choice requires further explanation.
2. The description of the implementation details is somewhat unclear. Since SAM2 is a prompt-based method, it is recommended to clarify the prompt settings used in the experiments—specifically, whether only text prompts are employed or if they are combined with other types of prompts such as points or bounding boxes. If only text prompt are used, it is interesting to show the method’s performance when using other prompt types (not necessary). In addition, it is also important to claim the prompt type used in SAM-based methods, since different prompt types may lead to significant performance gap.
3. In the comparison experiments (Table1), it would be beneficial to include comparisons with text-guided SAM-based methods [1, 2] or SAM2-based methods [3]. This is important because text prompts provide detailed object descriptions (e.g., location, color), going beyond simple referring segmentation. Comparing the proposed method only with segmentation approaches that do not utilize text prompts may be unfair, as the use of additional prior knowledge could give an advantage. Since additional experiment is not allowed in the rebuttal period, authors can provide textual explanation first, and add the experiment in the final version.
  - [1] Li, Y., Zhang, J., Teng, X., Lan, L., Liu, X.: Refsam: Efficiently adapting segmenting anything model for referring video object segmentation. arXiv preprint arXiv:2307.00997 (2024)
  - [2]Wu, J., Jiang, Y., Sun, P., Yuan, Z., Luo, P.: Language as queries for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4974–4984 (2022)
  - [3]Wang, H., Yang, G., Zhang, S., Qin, J., Guo, Y., Xu, B., Jin, Y., Zhu, L.: Video instrument synergistic network for referring video instrument segmentation in robotic surgery. IEEE Transactions on Medical Imaging 43(12), 4457–4469 (2024)
4. Figure 1 may need to refine, if TGSAM-2 can’t conduct segmentation merely using text prompt. The original version of Fig. 1(b) may cause misleading.
Minor problems:
1. It is recommended to illustrate the performance in Table 4 and Table 5 are the mean value of 4 datasets to avoid misleading.
2. Since the original SAM2 is a highly efficient segmentation method, it would be helpful to report the FPS (frames per second) of SAM2, Med-SAM2, and the proposed method to better illustrate the computational cost introduced by the TTME module.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N.A.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed TGSAM-2 method effectively integrates textual information into the SAM2 framework, leading to improved performance in medical image segmentation. Although the proposed TCVP and TTME modules are simple, they significantly enhance the model’s performance. The current experiments repeatedly demonstrate the superiority of the proposed method and the effectiveness of its components, along with corresponding analyses. The main concerns lie in the description of the experimental settings (specifically, the types of prompts used) and the fairness of comparisons with methods based solely on visual prompts. Therefore, this paper could be accepted once these issues are addressed.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors’ response adequately addressed my concerns. I maintain my original score (4) and recommend acceptance of the paper.

Author Feedback

We thank all reviewers for your valuable feedback. Due to character limits, we have done our best to address all comments. To R1 [Architectural description & Notation consistency] ①Thank you for your helpful suggestions. We’ll refine our notations (e.g., T, P), figures, and formulas to eliminate ambiguity and ensure clarity. ②As you correctly noted, M_s is passed through a Conv layer to produce m_features and a positioning encoding to generate m_pos. Both are stored in the memory bank and connected to memory attention for subsequent frames. We’ll clarify this workflow accordingly. [Experimental details] ①Thank you for pointing out the evaluation of HD and tracking ability. We’ll incorporate them and we’ll also define Dice and IoU in the revised version as suggested. Currently, our approach focuses on achieving optimal per-frame segmentation, which we believe contributes to overall performance in video settings. ②We apologize for the omission. Based on our prior experiments, the current settings yield the most robust results. The number of memory slots (e.g., 4,8,16) is configurable, with the model exhibiting a preference for more recent context—typically achieved with 4 slots. ③The maximum input text length is 512 tokens, as determined by the text encoder. This is sufficient, as the longest prompt in our experiments is 21 tokens.

To R2 [Alignment between textual and visual features] As shown in Fig.3(a), textual features serve as queries to guide visual perception and are linearly projected from [L,C] to [HW, C], to match the spatial dimensions of visual features before cross-attention. Thank you for your comment. We’ll refine the corresponding formula. [Experimental details] TGSAM-2 consistently outperforms other methods under identical input resolutions. We report the best results using 1024×1024 resolution, in line with the settings adopted in SAM-2.

To R3 [Reason for sparse embeddings] We consider only mask prompts are dense embeddings, providing pixel-level representations with inherent spatial information. Text embeddings, while dense vectors, lack spatial structure and instead act as “semantic anchors” for the mask decoder, similar in function to points and boxes. Thus, we treat them as sparse embeddings. [Minor problems] Thank you for your helpful suggestions. We’ll clarify that results in Tab.4&5 are the mean of 4 datasets.

To R1,R2 [Baseline and role of prompt encoder] The baseline model includes the image encoder, mask decoder, memory attention, and the original memory encoder from SAM-2, but excludes the prompt encoder. In Tab.3, “without text prompts” refers to this baseline with no textual input. The prompt encoder enables the model to interpret and respond to text prompts. We’ll revise Tab.3 to clearly report performance with and without the prompt encoder as suggested.

To R1,R3 [Prompt setting] This paper focuses on injecting textual semantics into SAM-2 for medical image sequences. Accordingly, we only employ text prompts. Our model also supports visual prompts as this capability is naturally inherited from SAM-2. We believe combining multiple prompt types is a promising future direction. [Computational cost] We appreciate the advice and will include a comparison of inference time in the final version.

To R2,R3 [Experimental fairness] ①SAM-based methods were fine-tuned on medical data.The point selection strategy follows MedSAM-2 and they are specifically prompted every 5 frames (or every frame in Tab.2). In contrast, TGSAM-2 is prompted with a fixed text prompt across all frames, presenting a more challenging yet practical setup. ②We agree that text prompts offer richer contextual information. Accordingly, We have compared with text-guided methods including LViT, LanGuide, and MMI-UNet. Thank you for pointing out the relevance of referring video object segmentation methods (e.g., RefSAM, LoSh). Our method maintains superior performance over them, and we’ll reflect these insights in the revised version.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

TGSAM-2: Text-Guided Medical Image Segmentation using Segment Anything Model 2

Author(s):