Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Digital Subtraction Angiography (DSA) is the gold standard in vascular disease imaging but it poses challenges due to its dynamic frame changes. Early frames often lack detail in small vessels, while late frames may obscure vessels visible in earlier phases, necessitating time-consuming expert interpretation. Existing methods primarily focus on single-frame analysis or basic temporal integration, treating all frames uniformly and failing to exploit complementary inter-frame information. Furthermore, existing pre-trained models like the Segment Anything Model (SAM), while effective for general medical video segmentation, fall short in handling the unique dynamics of DSA sequences driven by contrast agents. To overcome these limitations, we introduce TemSAM, a novel temporal-aware segment anything model for cerebrovascular segmentation in DSA sequences. TemSAM integrates two main components: (1) a multi-level Minimum Intensity Projection (MIP) global prompt that enhances temporal representation through a MIP-guided Global Attention (MGA) module, utilizing global information provided by MIP, and (2) a complementary information fusion module, which includes a frame selection module and a Masked Cross-Temporal Attention Module, enabling additional foreground information extraction from complementary frame. Our Experimental results demonstrate that TemSAM significantly outperforms existing methods. Our code is available at https://github.com/zhang-liang-hust/TemSAM.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2267_paper.pdf

SharedIt Link: https://rdcu.be/eHwLU

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04927-8_58

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/zhang-liang-hust/TemSAM

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ZhaLia_TemSAM_MICCAI2025,
        author = { Zhang, Liang AND Jiang, Xixi AND Ding, Xiaohuan AND Huang, Zihang AND Zhao, Tianyu AND Yang, Xin},
        title = { { TemSAM: Temporal-aware Segment Anything Model for Cerebrovascular Segmentation in Digital Subtraction Angiography Sequences } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15960},
        month = {September},
        page = {609 -- 618}
}

Reviews

Review #1

Please describe the contribution of the paper

This work introduces a new methodology for vessel segmentation on digital subtraction angiography (DSA), motivated by application-specific insights. It proposes a new architecture based on SAM which combines the temporal information of the sequence with the “global” information from the minimum intensity projection (MinIP).
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The use of complementary information from the most dissimilar frame, which is motivated by the acquisition properties of the DSA sequence The novel combination of the dual information stream, including the cross-attention between the temporal and global encoders. The use of acquisition information in designing the architecture is a welcome insight.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Although the work compares their performance with a variety of other works, the paper does not include a comparison with method [20], which is more similar to this work than many of the provided SOTA methods. It is unclear why there are no experiments using the DSCA dataset for training (introduced by [20]); instead all experiments train only on the DIAS dataset and run inference on both. Training on the DSCA dataset would provide a more direct comparison to all the methods that were also tested in [20].
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

It is unclear how the SOTA methods were evaluated: what information was used in training (e.g. was MinIP included or only the sequence?). It is ambiguous to which methods the last sentence of the implementation details applies: “For a fair comparison, all models are re-implemented and trained for 200 epochs under the same settings” The paper could benefit from more clarity: the training loss is not mentioned, there are discrepancies between the terminology used in the paper and Figure 2 (Equation 6 and Figure 2c are contradictory, there is no green bounding box, nor any fc input in Figure 2 as the first paragraph in Section 2.1 indicates).
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I think this paper is insightful and well-motivated; the problem it addresses is likewise relevant. The lack of direct comparison with [20] makes me doubt its superiority claims. It would benefit from more clarity and details on the SOTA comparison.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

My concerns regading comparison with baseline methods have been addressed.

Review #2

Please describe the contribution of the paper

The paper focuses on cerebrovascular segmentation in digital subtraction angiography (DSA) sequences and proposes a temporal-aware segment anything model (TemSAM). The main contributions including: (1) A multi-level MIP global prompt, utilizing a MIP-guided global attention (MGA) module to integrate global vascular priors from MIP images with local temporal features from DSA clips in a two-branch encoder.. (2) A complementary information fusion module, featuring a frame selection module and a masked cross-temporal attention (MCTA) module to aggregate foreground information from complementary frames. (3) Experiments include comparisons with SOTA methods, ablation studies, and qualitative visualizations, demonstrating the effectiveness of the proposed method.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Major strengths: (1) This paper is well-motivated, the authors highlights that existing methods fail to exploit global structure priors from MIP images and complementary information from specific frames, leading to development of the tailored solution. (2) The proposed method builds upon SAM and further integrates global (MIP) and local (temporal and complementary frames) information via MGA and MCTA modules, which are innovative and effectively addressing the unique challenges of cerebrovascular segmentation. (3) The paper conducts sufficient experiments, including comparisons with diverse SOTA methods, ablation studies, and qualitative analysis, demonstrating the superiority of the proposed method and importance of each component. (4) The paper is well-structured and clearly writing, it is easy to understand the paper. And the code is available at an anonymized repository.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Major weaknesses: (1) Ambiguous description about two-branch encoder. Why the authors use “Dual-Branch Dencoder (DBD)” in ablation study. In my understanding, the authors used two encoders to extract features from MIP images and video clips, respectively. (2) How to use the dense prompt in the SAM decoder? Generally, SAM adopt sparse prompt (e.g., bounding box or points). I cannot capture the details about how to use generated dense prompt as guidance to segment cerebrovascular? (3) In the two-stage decoder, the refined feature from MCTA is fed into decoder. So the same dense prompt is used in stage-2 decoder? In ablation study, does without MCTA means only using stage-1 decoder to predict the mask? Moreover, why do not update dense prompt in stage-2 decoder using the results from stage-1 decoder? (4) The method demonstrate its effectiveness on two small DSA datasets for serebrovascular segmentation. Whether there are large-scale DSA dataset to further demonstrate TemSAM’s superiority.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper is well-structured and presents the methodology and experimental results in a reasonablely clear manner. The authors first highlights that existing methods fail to exploit global structure priors and complementary temporal information from specific frames, motivating them to introduce two innovative modules, including MGA and MCTA. By integrating MGA and MCTA with SAM, the proposed method has a significant improvement on vessel segmentation. The experimental results outperforms existing task-specific / SAM-based / SAM2-based methods, and ablation study further validate effectiveness of each component. Although certain details in major weaknesses require to further clarification, I recommend this work to be “Weak Accept”.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have addressed my concerns in their rebuttal.

Review #3

Please describe the contribution of the paper

This paper addresses the task of vascular segmentation from DSA sequences by proposing a novel method that combines Minimum Intensity Projection (MIP) with complementary information fusion. The MIP serves as a global cue to enhance temporal awareness, while additional frames are used to extract complementary information. The proposed architecture incorporates multiple attention modules to effectively integrate global and temporal information, achieving superior segmentation performance compared to existing methods.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The use of MIP images to provide global context is insightful and beneficial for temporal sequence segmentation. Integrating this into a dynamic framework is a valuable contribution. The method thoughtfully combines various attention mechanisms to capture both the complementary nature of the temporal sequence and the structural information from global cues. This design effectively enhances segmentation accuracy.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The provided URL link in the paper does not work, which hinders the reproducibility of the results.
2. The description “remaining frames fc (in green bounding boxes)” is unclear, as green bounding boxes are not visible in the corresponding figure. Similarly, the term “current video clip clip_t” is ambiguous—does it refer to a single frame or multiple frames? The presence of two yellow boxes in the figure adds to the confusion.
3. It is unclear how segmentation performance is computed for each sequence. Is the result averaged over all frames, or is a representative frame selected for evaluation? Clarification is needed.
4. Comparison against nnU-Net is essential, given its significant impact on recent medical segmentation benchmarks.
5. The use of MIP images for segmentation has been explored in several prior works [1,2,3]. A more thorough discussion in the related work section is necessary to highlight the differences and improvements over existing approaches.
6. The experimental section should provide a brief justification for the design and effectiveness of the “Frame Selection” module, as this is critical to supporting the paper’s conclusions.
7. There are several minor writing and formatting issues that should be addressed: a) The use of italic and regular fonts in equations is inconsistent. b) The position of label (a) in Figure 2 is oddly placed. c)In Section 2.3, the sentence “Specifically, Given current clip’s mean feature” includes incorrect capitalization. Overall, the manuscript would benefit from a careful proofreading before submission.
References: [1] JointVesselNet: Joint volume-projection convolutional embedding networks for 3D cerebrovascular segmentation (MICCAI 2020) [2] 3D arterial segmentation via single 2D projections and depth supervision in contrast-enhanced CT images (MICCAI 2023) [3] 3D vascular segmentation supervised by 2D annotation of maximum intensity projection (TMI 2024)
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the paper presents a methodologically innovative approach by integrating global and temporal information for DSA sequence segmentation, several aspects such as experimental design, clarity in presentation, and completeness of related work discussion require improvement.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The author’s answers addressed my concerns. I consider this work acceptable.

Author Feedback

We sincerely appreciate the reviewers’ efforts in evaluating our work and acknowledge the paper’s contributions, including our innovative method and promising results. We address the key issues below. To #R2: a)(Q2) Dense prompt. The dense prompt in SAM (i.e.,mask) is embedded using convolutions to align its dimensions with the image embedding. They are then fused via pixel-wise addition. b)(Q3) Details of Decoder. (1) The global MIP feature serves as the embedded dense prompt in both two stage decoders, without updated by the stage-1 prediction. (2) “Without MCTA” means only using stage-1 decoder for prediction. (3) Our experiment demonstrates that updating the dense prompt with stage-1 prediction degrades performance(-0.4% Dice) due to error propagation. This occurs because the dense prompt’s role is to maintain stable global priors of vascular structures, while the low-resolution outputs in stage-1 may introduce local noise that weakens this structural guidance. c)(Q4) Datasets. Only these two DSA video datasets are currently available. We hope our work will inspire further research. To #R3 a)Method [20]. When trained on DIAS, our method achieves +8.5% and +23.8% higher Dice than [20] on DIAS and DSCA respectively. b)Training on DSCA. We use the DSCA to evaluate the generalization ability. Due to space limitations, the reverse evaluation (training on DSCA and testing on DIAS) is omitted in the main paper. When trained on DSCA, our method outperforms +4.2% over ST-UNet, +6.7% over [20], since [20]’s naive deep-layer fusion fails to effectively model MIP’s structure guidance. Differently, our method ensures dynamic feature fusion through multi-scale cross-attention in encoding, while propagating MIP’s rich contextual information via dense prompt in decoding. c)Setting details. The SOTA methods only input the sequence for training. And we employ a combination of BCE loss and erosion-based HD loss[TMI2019], with a 1:5 weight ratio. To #R4: a)(Q4) nnU-Net. Our method demonstrates +5.5%(DIAS) and +7.2%(DSCA) higher Dice than nnU-Net respectively, benefiting from our multi-level MIP prompts, advanced semantic fusion module, effective integration of SAM’s inherent architectural advantages. b)(Q5) MIP-related work. Prior MIP-based works for 3D/video vessel segmentation either treated 2D MIP annotations merely as weak supervision signals (3Dseg-mip-depth2023), ignoring its spatial structure guidance, or employed simple fusion strategies like concatenation[JointVesselNet2020, WSMIP2024] or deepest feature fusion[DSANet-TMI2025], failing to fully model the correlation between temporal/3D features and MIP’s global spatial structure or enforce vessel geometry constraints. In contrast, our method dynamically aligns MIP’s global structure with temporal features via multi-scale cross-branch attention during encoding, while propagating MIP’s high-level semantic as persistent structure guidance throughout decoding. c)(Q6) Frame Selection Module. Without this module, the Dice decreased by 0.4% and the clDice decreased by 1.1%. This indicates that complementary frames provide essential information enrichment for local clips. To #R2,R3,R4 a)(#R3#R4-Q2) Figure 2. We are sorry for the confusion. To clarify: the sequence includes frames [F₁…F_n], where clip_t refers to [F_{t−1}, F_t, F_{t+1}] (green boxes), and other frames are shown in yellow box. The color will be corrected. Additionally, “remaining frames f_c” should be ‘f_r’ in Figure 2. b)(#R4-Q3) As detailed in section 2.1, the final prediction averages all clips’ prediction, so we evaluate the performance of the result averaged over all clips. c)(#R2-Q1#R3#R4-Q7) The revised paper will correct all identified writing, figure, and formulation issues. a)(#R4-Q1) The inaccessible code link has been fixed. Full code will be released upon paper acceptance.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

TemSAM: Temporal-aware Segment Anything Model for Cerebrovascular Segmentation in Digital Subtraction Angiography Sequences

Author(s):