Abstract

Surgical scene segmentation is critical in computer-assisted surgery and is vital for enhancing surgical quality and patient outcomes. Recently, referring surgical segmentation is emerging, given its advantage of providing surgeons with an interactive experience to segment the target object. However, existing methods are limited by low efficiency and short-term tracking, hindering their applicability in complex real-world surgical scenarios. In this paper, we introduce ReSurgSAM2, a two-stage surgical referring segmentation framework that leverages Segment Anything Model 2 to perform text-referred target detection, followed by tracking with reliable initial frame identification and diversity-driven long-term memory. For the detection stage, we propose a cross-modal spatial-temporal Mamba to generate precise detection and segmentation results. Based on these results, our credible initial frame selection strategy identifies the reliable frame for the subsequent tracking. Upon selecting the initial frame, our method transitions to the tracking stage, where it incorporates a diversity-driven memory mechanism that maintains a credible and diverse memory bank, ensuring consistent long-term tracking. Extensive experiments demonstrate that ReSurgSAM2 achieves substantial improvements in accuracy and efficiency compared to existing methods, operating in real-time at 61.2 FPS. Our code and datasets are available at https://github.com/jinlab-imvr/ReSurgSAM2.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0617_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/0617_supp.zip

Link to the Code Repository

https://github.com/jinlab-imvr/ReSurgSAM2

Link to the Dataset(s)

https://github.com/jinlab-imvr/ReSurgSAM2/tree/main/datasets

BibTex

@InProceedings{LiuHao_ReSurgSAM2_MICCAI2025,
        author = { Liu, Haofeng and Gao, Mingqi and Luo, Xuxiao and Wang, Ziyue and Qin, Guanyi and Wu, Junde and Jin, Yueming},
        title = { { ReSurgSAM2: Referring Segment Anything in Surgical Video via Credible Long-term Tracking } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {434 -- 444}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a two-stage surgical segmentation strategy based on the SAM2 framework, incorporating an initial frame selection mechanism to enhance the performance of the subsequent tracking method. With the optimised automatic initial frame selection strategy, the authors argue that the proposed method is more robust than the original SAM2 framework for the surgical scene segmentation and subsequent tracking, minimising erros due to a wrong initial frame selection.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper proposes an effective strategy for selecting the initial frame for segmentation based on a referring text prompt, which is particularly relevant for the surgical segmentation domain. When interacting with surgeons, initiating tracking from a frame with poor image quality or instrument occlusion, following the reception of a text prompt, can degrade subsequent performance. The proposed strategy addresses these limitations by ensuring a more suitable starting frame, thereby improving the overall performance.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While the proposed methodology is promising, the experimental evaluation should be more exhaustive. At present, it is unclear how robust the method is to different initializations, specifically, how many frames are evaluated following the reception of the prompt. Additionally, the paper does not address how the method handles changes in the segmentation target. For instance, if the surgeon wishes to include additional structures or tools after the process has already been initialized, it is unclear how the system adapts to this scenario.

    In the discussion section, the authors should clearly outline the limitations of the current approach and suggest directions for future work.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed methodology is promising and has the potential to enhance the adaptability of SAM2-based frameworks for surgical tool segmentation, where robust tracking, and thus overall performance, relies heavily on accurate initial frame segmentation. However, additional experiments and a more detailed description are needed to better assess the method’s applicability and robustness, particularly in scenarios where the prompt is issued under challenging conditions.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The main contribution of the paper is the proposed two-stage framework for referring video object segmentation (RVOS) in surgical videos that combines text-referred detection and long-term tracking. Its main contributions include the introduction of a cross-modal spatial-temporal Mamba (CSTMamba) for effective fusion of visual and textual information, a credible initial frame selection (CIFS) strategy to ensure robust tracking initialization, and a diversity-driven long-term memory (DLM) mechanism that maintains high-confidence and diverse memory frames to support consistent tracking over long surgical procedures.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The proposed framework achieves substantial gains in accuracy (J&F scores) across multiple surgical video benchmarks compared to both offline and online state-of-the-art methods. 2) The method maintains real-time processing at 61.2 FPS, making it suitable for clinical deployment. 3) The paper includes extensive quantitative comparisons and ablation studies, demonstrating the contribution of each proposed component.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) The paper does not provide a comprehensive description of how the proposed model is trained, and a reader cannot fully reproduce the training process without references. 2) The paper lacks a detailed analysis of failure cases or qualitative errors, which limits the understanding of the model’s robustness and practical limitations.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The major factors contributing to this decision are its strong empirical performance, real-time capability, and a well-motivated two-stage design tailored for the challenging task of referring video object segmentation (RVOS) in surgical settings. However, the paper has weaknesses that should be addressed to improve clarity and reproducibility. Despite these concerns, the paper makes a valuable contribution to surgical vision by enhancing both the usability and performance of referring segmentation systems.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper presents a novel approach to efficient online referring video object segmentation (RVOS) in surgical procedures, leveraging the SAM2 framework. The proposed method extends SAM2 for surgical long video segmentation, alleviates ambiguities in the initialization stage, and demonstrates accurate and efficient performance on real surgical datasets.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The limitations associated with naively adapting SAM2 for RVOS of very long surgical videos are clearly formulated and convincingly addressed using the proposed CIFS and DLM designs.

    • The pipeline, which integrates the proposed CSTMamba module, achieves accurate, fast, and online text-based real-time segmentation for surgical videos, potentially offering significant clinical impact.

    • The authors have conducted detailed experiments and studies to demonstrate the effectiveness of the proposed method and its individual modules.

    • The provided video convincingly demonstrates the improved accuracy over baseline method and its application.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The authors refined the annotations of the Ref-EndoVis17 and Ref-EndoVis18 datasets instead of using the RSVIS version. I am curious about the performance on the RSVIS annotations, as it may demonstrate the robustness of the CLIP-based backbone against imperfect annotations. Also will the re-annotated datasets be released to ensure reproducibility?

    • There is a notable absence of SAM2 baselines in the benchmark. Incorporating SAM2 baselines—such as by combining https://github.com/IDEA-Research/GroundingDINO.git with SAM2—could better highlight the improvements over the limitations of SAM2 discussed in the introduction. While it may be challenging to isolate the performance gains from adapting SAM2 checkpoints alone, the overall selection of baseline methods is still somewhat reasonable.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • The merits and distinctions of using RVOS over visual prompts could be better discussed and motivated in the introduction section.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, this is a good paper that presents an online, accurate, and efficient approach for referring video object segmentation (RVOS) in long surgical videos, with the potential for significant clinical impact. However, to ensure reproducibility upon acceptance, the re-annotated dataset should be made publicly available.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely thank the reviewers for their valuable feedback and thoughtful suggestions. We are pleased that all reviewers recognized the potential clinical impact and methodological contribution of our work. Below, we address the specific concerns raised:

Response to R1

  1. Initial frame evaluation following prompt reception: Our method evaluates frames comprehensively throughout the entire sequence after receiving a text prompt. Specifically, we begin evaluation from the first frame and continue until the last frame of the sequence - this includes 149 frames for Ref-Endovis18 and 300 frames for Ref-Endovis17. This thorough evaluation ensures robust performance across the full duration of surgical videos.
  2. Adaptation to changes in segmentation targets: Our algorithm is designed to handle changes in segmentation targets and can simultaneously track multiple objects. When tracking multiple targets, our method efficiently shares computation by using the same image encoder features while employing independent decoders for each object. This architecture allows for efficient multi-object tracking while maintaining performance for each individual target.
  3. Limitations and future work: We appreciate the suggestion to include limitations and future directions. The current approach relies on the accuracy of text-referred object identification in the detection stage. In future work, we aim to enhance both the accuracy and efficiency of the text-referring mechanism.

Response to R2 We thank the reviewer for highlighting the strengths of our method, particularly regarding accuracy improvements and real-time processing capabilities. We will release all the datasets, source code, and model weights for reproduction purposes, and provide more comprehensive implementation details in the camera-ready version.

Response to R3 We appreciate the positive assessment of our approach. Regarding the dataset annotations, we will release our refined annotations for Ref-EndoVis17 and Ref-EndoVis18 to ensure reproducibility. We also thank the reviewer for providing valuable suggestions regarding SAM2 baselines. Integrating GroundingDINO with SAM2 would require significant architectural modifications and retraining, as they weren’t originally designed for the referring video object segmentation task in surgical settings. Our present baseline selection already includes strong comparative methods that cover the spectrum of both offline and online approaches. However, we acknowledge the value of this suggestion and will consider exploring such integration in future extended work beyond the scope of this paper.

General Response We will release all datasets, source code, and model weights upon publication to ensure full reproducibility of our work. We believe this will contribute significantly to advancing research in surgical video segmentation and tracking.

Thank you once again to the reviewers for their insightful and constructive feedback, which will significantly enhance the quality and clarity of the final version of our paper.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    If the paper is accepted, the final version should address all reviewer comments and concerns, particularly clarity and reproducibility.



back to top