Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Lesion segmentation in breast ultrasound videos plays a crucial role in the early detection and intervention of breast cancer. However, it remains a challenging task due to blurred lesion boundaries, substantial background noise, and significant scale variations of lesions across frames. Existing methods typically rely on selecting preceding frames for rudimentary temporal integration but fail to achieve satisfactory segmentation performance. In this paper, we propose STMFSAM, a novel Spatio-Temporal Memory Filtering SAM network, designed to leverage the powerful feature representation and modeling capabilities of SAM for lesion segmentation in breast ultrasound videos. Specifically, we introduce a memory mechanism that stores and propagates essential spatio-temporal features across frames. To enhance segmentation accuracy, we select three relevant reference frames from the memory bank as dense prompts for SAM, enabling it to retain long-term contextual information and effectively guide the segmentation of subsequent frames. To further mitigate the impact of background noise, we present the Spatio-Temporal Memory Filtering module, which selectively refines the memory content by filtering out irrelevant or noisy information. This ensures that only meaningful and informative features are retained for segmentation. We conduct extensive experiments on the UVBSL200 breast ultrasound video dataset, demonstrating that STMFSAM outperforms existing methods. Additionally, to highlight our model’s generalization capability, we achieve competitive results on two video polyp segmentation datasets. The code is available at https://github.com/tzz-ahu/STMFSAM.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1900_paper.pdf

SharedIt Link: https://rdcu.be/eHwOq

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04937-7_52

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/tzz-ahu/STMFSAM.

Link to the Dataset(s)

N/A

BibTex

@InProceedings{TuZhe_SpatialTemporal_MICCAI2025,
        author = { Tu, Zhengzheng AND Zong, Liang AND Jiang, Bo AND Wang, Haowen AND Wang, Kunpeng AND Zhang, Chaoxue},
        title = { { Spatial-Temporal Memory Filtering SAM for Lesion Segmentation in Breast Ultrasound Videos } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15961},
        month = {September},
        page = {547 -- 557}
}

Reviews

Review #1

Please describe the contribution of the paper

segmentation in breast ultrasound videos, integrating SAM with spatio-temporal context from a memory bank. The authors claim their approach enhances informative memory features while reducing redundant information and noise. The effectiveness of the proposed module is demonstrated using a large-scale breast ultrasound dataset.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Strong motivation: clearly justified the use of memory banks to encode and store reference frames information, effectively improving lesion segmentation.
- Thorough empirical evaluation:
- Evaluated against multiple competitive image- and video-based baseline models.
- Demonstrates strong quantitative improvements over all baselines.
- Consistently achieves high segmentation performance across three datasets.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Limited Novelty: The proposed “Spatio-temporal Memory Filtering” framework may lack novelty, as similar ideas have been explored in prior work and are becoming established in ultrasound video segmentation. For instance, [1] disentangles spatial and temporal attention while using a memory bank to track lesion movement. [2] introduces an adaptive memory mechanism that stores segmentation history to guide current predictions. [3] learns temporal features in the frequency domain and predicts additional lesion positions to assist segmentation (note: [3] is reported for polyp segmentation in the paper, but not UVBLS200).
- Backbone Selection: The use of SAM as the backbone is questionable, given that MedSAM has demonstrated superior performance in medical imaging. The paper does not justify this choice, nor does it include MedSAM in the comparisons.
- Unsupported Claims: The paper claims to reduce background noise and filter irrelevant information, but no direct evidence is provided. While improved Dice scores suggest potential benefits, it’s unclear whether the gains are due to noise filtering / irrelevant information suppression or other factors. A more targeted analysis is needed to support these claims.
- Incomplete Evaluation Details: The paper lacks important implementation details, including data splits and hyperparameter tuning strategies. Additionally, it does not report standard deviations across multiple runs, leaving the robustness of the results unclear.
References: [1] Rethinking Breast Lesion Segmentation in Ultrasound: A New Video Dataset and A Baseline Network [2] Ultrasound Video Segmentation with Adaptive Temporal Memory [3] Shifting More Attention to Breast Lesion Segmentation in Ultrasound Videos
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- Equation (1) seems wrong as the encoder E should encode frame F, not I.
- Using F for both Frame and Readout is confusing.
- The writing in 2.1 is hard to follow (e.g. groundtruth is unexpectedly used for the first frame. Why aren’t groundtruths of other frames used?)
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

With limited novelty, unsupported claims, unclear experiment details and design choices, I find the paper is not ready to be published at MICCAI. However, I’m willing to raise my scores if the weaknesses mentioned above are addressed or justified appropriately.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

While the lack of comprehensive evaluation results and details remain, the authors have clarified my main concern regarding the technical contribution of the method. Overall, the presentation of the paper is good and the approach could be beneficial to other domains.

Review #2

Please describe the contribution of the paper

This article proposes the STMFSAM network for lesion segmentation in breast ultrasound videos. By introducing a memory mechanism to store and propagate spatio-temporal features and using a spatio-temporal memory filtering module to optimize memory content, the model achieves excellent results on relevant datasets and demonstrates good generalization ability.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The combination of the memory mechanism with the SAM model effectively utilizes the temporal information in video sequences,and has achieved good results The author conducted rigorous ablation experiments to verify the effectiveness of each module and method. The overall structure of the article is clear.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

From the ablation experiment data, the improvement of STMF on various indicators is very limited, and most of them are less than 1%. Lack of In-depth Discussion on the Combination with the Medical Field: Although the model performs well technically, there is little mention of potential problems and challenges in practical clinical applications in the article. The article only uses a specific breast ultrasound video dataset (UVBSL200) for the main experiments, and the diversity of the dataset may be insufficient.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The experimental design is innovative, but the improvement in the experimental results is not obvious.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

This rebuttal demonstrates a comprehensive and professional response to all reviewer concerns. The authors Acknowledge limitations appropriately while proposing solutions, Maintain scientific rigor: Accept with mandatory inclusion of promised revisions (computational analysis, evaluation details, and notation/clarity improvements).

Review #3

Please describe the contribution of the paper

This paper proposes STMFSAM, a novel framework for segmenting lesions in breast ultrasound (BUS) videos by enhancing the Segment Anything Model (SAM) with temporal reasoning capabilities. It is an adaptation of the SAM architecture for video segmentation, particularly targeting the challenges of BUS videos. The authors introduce a memory bank to store and propagate spatio-temporal features accross frames, and a novel module designed to refine the features retrieved from the memory bank. It demonstrates decent performance on the UVBSL200 breast ultrasound video dataset compared to various image-based and video-based segmentation methods, including baseline SAM and SAMUS.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) Effectively adapting the SAM model to the sequential nature of video data 2) The proposed memory mechanism and STMF module directly target known difficulties in BUS video segmentation: temporal inconsistencies, lesion shape changes, and significant background noise 3) The study achieves state-of-the-art performance on the target BUS dataset (UVBSL200), outperforming prior methods applied on the same dataset.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1) The paper does not discuss the computational overhead introduced by the memory mechanism and the STMF module. 2) The memory bank capacity is fixed at 8. There is no analysis of how performance changes with different memory sizes, which could be important for videos of varying lengths or complexity. 3) The method requires the ground truth mask for the first frame to initialize the memory value storage. The sensitivity of the model to the quality or absence of this initial mask (e.g., using a predicted mask instead) is not explored, which is relevant for fully automatic scenarios. 4) The ablation study shows that the proposed Reference Frame Selection Algorithm (RFSA) outperforms random selection. However, it doesn’t isolate the contribution of each type of reference frame (first vs. previous vs. similar) to understand their relative importance.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

1) Consider discussing the computational requirements and potential for real-time application. 2) An analysis exploring the sensitivity to memory bank size could strengthen the paper. 3) Investigating the impact of the first-frame ground truth requirement (e.g., using a predicted mask for the first frame) would be valuable for assessing applicability in fully automated pipelines.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper makes a significant contribution by successfully extending the powerful SAM model for temporal analysis in medical video segmentation, a relevant and challenging area.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The justifications provided by the authors in their rebuttal are acceptable in addressing some of the identified drawbacks. Despite remaining limitations, I believe the paper is suitable for acceptance and publication in MICCAI due to its novel adaptation of SAM for breast ultrasound video segmentation.

Author Feedback

We sincerely thank all reviewers and chairs. We also appreciate their recognition of our work’s strengths—like “innovative experimental design” (R1), “effectively adapting the SAM model to the sequential nature of video data” (R2), “strong motivation” and “thorough empirical evaluation” (R3). We will address all questions and concerns point by point below.

STMF improvement is limited(R1): STMF targets key BUS challenges (temporal inconsistencies, noise), offering qualitative improvements and stability (see Fig. 1 in our paper), aligning with R2’s positive remarks on this aspect.

Lack of in-depth discussion on clinical combination(R1): We concur that a detailed exploration of challenges in clinical deployment is important. This involves extensive translational research beyond the scope of a MICCAI methodological paper, and we plan to investigate these aspects in our future work.

Dataset diversity(R1): We clarify that, as R3 also noted (Strength#2c), our method was evaluated on three datasets (UVBSL200, CVC300, CVC612), detailed in Sec. 3.1&3.3 and Table 1&3 of our paper.

No discussion of computational overhead(R2): We will add a brief discussion on computational overhead in the revised manuscript. The current focus was on segmentation accuracy and robustness, where we believe the trade-off is justified for the performance gains in this challenging task.

Memory bank capacity fixed at 8, no sensitivity analysis(R2): The size of 8 was an empirical choice balancing performance and efficiency. A detailed sensitivity analysis is valuable future work.

Requires first-frame GT; sensitivity not explored(R2): This is a common practice in video object segmentation methods to provide a robust initialization for the memory bank. Exploring sensitivity to the quality of this initial mask is a valuable point for future work.

RFSA ablation doesn’t isolate contributions of reference frame types(R2): We agree that a more fine-grained ablation isolating the contribution of each reference frame type would be interesting. We will consider this for future detailed analysis

Limited Novelty(R3): We appreciate R3 pointing out related works. Unlike Ref1, which uses a memory bank for tracking by disentangling spatial/temporal attention, STMFSAM employs a distinct three-frame selection (first, previous, most similar) from its memory bank to specifically generate dense prompts for SAM. More importantly, distinct from Ref2, which introduces an adaptive memory of segmentation history, our STMF module actively filters the rich spatio-temporal features retrieved from our memory bank (not just prior segmentation masks) before they prompt SAM, providing a more robust contextual input. Ref3 uses frequency-domain temporal features and auxiliary position prediction, whereas we operate directly on spatio-temporal features and focus on refining these retrieved memory features as direct prompts for SAM. Therefore, our primary novelty lies in the specific synergistic integration of SAM with a filtered multi-frame memory prompting mechanism tailored to the challenges of breast ultrasound video segmentation.

Backbone Selection(R3): We chose SAM to explore adapting a general foundational model. A MedSAM comparison is valuable future work.

Unsupported Claims(R3): STMF’s design (Sec. 2.2) inherently aims to filter noisy features. Qualitative examples in Fig. 3 of our paper support its robustness. We will ensure claims precisely match the evidence presented and add the targeted analysis.

Incomplete Evaluation Details(R3):We will add data splits and hyperparameter settings to Section 3. We acknowledge standard deviations were not reported (experiments used fixed seeds for reproducibility).

Equation (1) error(R3): Thank you. This will be corrected in the final version.

‘F’ for Frame and Readout is confusing(R3): We agree and will revise the notation for clarity.

Sec 2.1 hard to follow(R3): Sec 2.1 will be revised for improved clarity.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

I have read the manuscript, review comments, rebuttal letter. All reviewers recommend acceptance (after rebuttal). This meta reviewer believes that the authors did a good job in addressing concerns.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

Spatial-Temporal Memory Filtering SAM for Lesion Segmentation in Breast Ultrasound Videos

Author(s):