Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Surgical video segmentation is a critical task in computer-assisted surgery, essential for enhancing surgical quality and patient outcomes. Recently, the Segment Anything Model 2 (SAM2) framework has demonstrated remarkable advancements in both image and video segmentation. However, the inherent limitations of SAM2’s greedy selection memory design are amplified by the unique properties of surgical videos—rapid instrument movement, frequent occlusion, and complex instrument-tissue interaction—resulting in diminished performance in the segmentation of complex, long videos. To address these challenges, we introduce Memory Augmented (MA)-SAM2, a training-free video object segmentation strategy, featuring novel context-aware and occlusion-resilient memory models. MA-SAM2 exhibits strong robustness against occlusions and interactions arising from complex instrument movements while maintaining accuracy in segmenting objects throughout videos. Employing a multi-target, single-loop, one-prompt inference further enhances the efficiency of the tracking process in multi-instrument videos. Without introducing any additional parameters or requiring further training, MA-SAM2 achieved performance improvements of 4.36% and 6.1% over SAM2 on the EndoVis2017 and EndoVis2018 datasets, respectively, demonstrating its potential for practical surgical applications. The code is available at https://github.com/Fawke108/MA-SAM2.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2634_paper.pdf

SharedIt Link: https://rdcu.be/eHw3G

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05127-1_32

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Fawke108/MA-SAM2

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YinMin_MemoryAugmented_MICCAI2025,
        author = { Yin, Ming AND Wang, Fu AND Ye, Xujiong AND Meng, Yanda AND Fu, Zeyu},
        title = { { Memory-Augmented SAM2 for Training-Free Surgical Video Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {328 -- 337}
}

Reviews

Review #1

Please describe the contribution of the paper

The authors introduce a training-free enhancement of SAM2’s memory management strategy. The proposed method integrates a context-aware memory module to improve contextual understanding over long video sequences, and an occlusion-resilient memory module to mitigate segmentation errors caused by occlusions of tracked objects. This approach is validated on the task of surgical video segmentation, demonstrating consistent performance improvements over the original SAM2 as well as other state-of-the-art tracking models.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The method enhances SAM2 without requiring retraining or additional fine-tuning. The memory modules can be directly integrated into the SAM2, offering a flexible improvement to the base model.
2. Instead of relying on a sequential memory update, the paper introduces a mask quality based memory selection. This leads to more robust and accurate memory updates.
3. By preventing memory corruption from low-quality masks, the model effectively handles key challenges in surgical video segmentation, such as instrument occlusion and disappearance. It also ensures better temporal consistency and long-term tracking stability over video sequences.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The paper introduces a “one-prompt” strategy for simultaneous multi-target inference, but its explanation lacks clarity. It is unclear how a single prompt can generate separate segmentation masks for multiple visible instruments. Does it correspond to one prompt mask per category at initialization? Additionally, if inference of multiple objects is done in a single forward pass per frame (instead of per category), this should be explicitly stated and supported with a detailed description or figure.
2. Equation (2) is hard to understand. The expression M_s (M_s \cap M_a) is ambiguous. Does it denote pixel-wise multiplication or some other operations between the two terms? Furthermore, while the text discusses the calculation of M_f, the equation instead labels the result as M, which is confusing for readers.
3. In Equation (1), c belongs to C is presented as a generic formulation, and in the text it is written as c = {A,B,C}, where C is a specific category. It would improve clarity to distinguish general formalism from specific examples.
4. “If the overlap ratio between M_f and M_s falls within a pre-defined threshold range, the current frame is considered to exhibit significant interference.” However, the threshold value, how it is determined, or its impact on results is not specified. Without source code or additional experimental details, this would limit the reproducibility.
5. The sentence “Our model was implemented using PyTorch and trained and evaluated on an Nvidia RTX3090 16GB GPU” The paper describes MA-SAM2 as training-free. This wording is misleading and should be clarified.
6. The explanation of the mask selection process in Occlusion-Resilient Memory is overly complex and lacks intuitive guidance. The series of steps would greatly benefit from an example figure for better explantation.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the paper proposes a fine-tuning-free enhancement to SAM2 for surgical video segmentation through memory-augmented strategies, several important issues limit its readiness for publication at this stage. The core methodological components, the one-prompt multi-target inference and the occlusion-resilient memory mechanism, are insufficiently explained, with key steps described ambiguously. Equation (2), which plays a central role in the method, is mathematically unclear and introduces confusion around mask merging. Additionally, threshold values crucial for mask selection are not defined. While the idea and experiments are promising, the paper needs clearer explanations and more detail to be ready for publication. I encourage the authors to address these concerns in a future revision.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

In this paper, the authors propose to modify the memory seguential memory bank of SAM2 with the proposed memory augmentation strategy consisting of occlusion resilient memory and context aware memory to better handle the occlusion and temporary dis-apperance of the surgical instruments in Video object Segmentation.

They demonstrate significant improvement in the zero shot performance of SAM2 for surgical videos using the proposed modification
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is well written and easy to follow. The contributions are simple, yet novel. The paper presents sufficient ablation studies to demonstrate the effectiveness of each of the components.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The authors have not promised to release the code post publication. Since most of the implementation details are missing, it is hard to reproduce the paper without the source code.
2. The paper does not clarify the prompting strategy. Is there a separate prompt for each of the class? Is the user prompting it strictly only once in the first frame or the user is correcting the prompts wherever SAM2 fails?
3. The paper claims to demonstrate zero shot performance. However, since it requires manual annotation of the first frame as the prompt, it can not be called zero shot in its true sense. Moreover, the authors do not provide a clarity on how the prompts are provided with respect to the instruments that appear for the first time in the middle of the video.
4. The authors do not demonstrate the fine-tuning results. given the fact that the original SAM2 was not trained for surgical images, it would be interesting to see the performance of the proposed architecture with fine-tuning (may be something like PEFT, LoRA etc)
5. The authors do not provide an exhaustive comparison with the state of the art. Architectures such as MedSAM, MedSAM2, Grounding DINO, BioMedParse etc can be a good comparison.
6. It would be better if the authors can report the full training SOTA numbers for the readers to compare.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

It is a good architecture. However, lack of experimental evidences and details about the prompting reduces the acceptability of the paper.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper introduces a surgical instrument tracking approach MA-SAM2. This tracking is training free, sits on top of SAM2 and enhances it by introducing an occlusion-resilient memory and a context-aware memory to include temporal feature information and ensure temporal consistency.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper uses the SOTA SAM2 and adapts it to surgical instrument tracking. The text is well structured, the provided solution is a nice idea. It addresses real-world problems in surgical videos. The results are very promising. Ablation studies provide more insights about the influence of different components.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

I have not found any indication that MA-SAM2 will be available open access. This might be important to ensure a progress for the community. Minor weakness is the lack of discussion of the results. Why are the results for the LND instrument category not that promising?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I would have voted for strong accept, if the method is available for the community. I might have missed that?
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank all reviewers for their constructive and encouraging comments and careful reading. We are pleased that reviewers found the work to be novel (R1, R2, R3), well-organised (R1, R2), and effective in addressing the challenges of surgical video segmentation (R1, R2, R3). Below, we clarify and respond to the main concerns raised, organized by topic. Code (R1, R2): We will release the source code, configuration details of MA-SAM2 upon paper acceptance. Results Analysis (R1): Thank you for the comment. The relatively lower performance on LND may result from frequent overlaps with similar instruments, which make it more susceptible to interference during memory updates. MA-SAM2 relies on confidence and IoU scores to maintain memory quality, which may filter out some LND masks under complex interactions. Despite this, the model achieves stable gains across most categories, and we consider this a reasonable trade-off for overall robustness. We will provide a more detailed discussion in the final submission. Prompting Strategy and “Zero-shot” Definition (R2, R3): Regarding the term “zero-shot,” our method follows SAM’s definition, which refers to inference without any training or fine-tuning on the target domain. MA-SAM2 adheres to this definition by operating entirely in a training-free manner, relying only on a single manual mask prompt per category at its first appearance. As clarified in Section 1, paragraph 5, MA-SAM2 adopts a mask-based one-prompt strategy for each instrument category, where a single prompt is provided only at the first appearance of the category within the video, rather than in the first frame. No additional manual intervention or corrective prompts are required throughout the remainder of the video. This strategy enables efficient inference across long and complex surgical sequences. We will ensure these points are clarified in the final submission. Comparison Scope and Baselines(R2): We thank the reviewers for their suggestions. While adapting fine-tuned models for surgical video segmentation is a promising direction, our method is explicitly designed for training-free inference, without relying on LoRA or other PEFT techniques. Accordingly, we selected baselines with the same setting to ensure fair comparison. The suggested models are designed for static image analysis rather than video segmentation (e.g. MedSAM, MedSAM2, BioMedParse, Grounding DINO). Including such methods would introduce mismatched assumptions. Similarly, fully supervised SOTA models, which depend on large-scale labelled data, operate under fundamentally different constraints. Our current scope is deliberately limited to highlight the advantages of MA-SAM2 under strictly training-free conditions. Occlusion-Resilient Memory Clarity (R3): Thank you for the suggestion. Briefly, ORM evaluates candidate masks by computing average IoU, selects the most reliable one, and filters interference through connected component analysis and overlap constraints. This allows the model to handle multi-object occlusions and retain stable predictions. We will revise the text for clarity, and the released code will include clear documentation and comments to support reproducibility. Presentation Clarity(R3): Thank you for your feedback. In Eq.2, the expression is intended to denote pixel-wise multiplication. We acknowledge that the notation in Eq.1, as well as the phrase “trained and evaluated,” may confuse. We will revise the relevant equations and descriptions to improve clarity and consistency. Interference Threshold(R3): Thank you for pointing this out. We define significant interference when the overlap ratio is <0.8 or >1.2, indicating substantial mask deviation. This empirically chosen range helps filter unreliable frames during memory updates. We will clarify this threshold and its role in the final version to support reproducibility.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

Authors need to clarify critical concerns raised by the reviewers.
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

If the paper is accepted, reviewer comments and suggestions should be included in the final version. In particular, the source code should be public, as promised in the rebuttal, to ensure reproducibility.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Although the paper has received one reject, however based on the other reviews and the rebuttal, there is a potential and merit in the work, and based on this, I vote for acceptance.

back to top

Memory-Augmented SAM2 for Training-Free Surgical Video Segmentation

Author(s):