Abstract

Digital Subtraction Angiography (DSA) sequences are the gold standard for diagnosing most Cerebrovascular diseases (CVDs). Rapid and accurate recognition of CVDs in DSA sequences helps clinicians make the right decisions, which is important in clinical practice. However, the pathological characteristics of CVDs are numerous and complex, and the spatiotemporal complexity of DSA sequences is high, making the diagnosis of CVDs challenging. Therefore, in this paper, we propose a novel CVDs classification framework CLIP-DSA based on CLIP, a pre-trained vision language model. We aim to utilize textual knowledge to guide the robust classification of common CVDs in multi-view DSA sequences. Specifically, our CLIP-DSA comprises a dual-branch vision encoder and a text encoder. The vision encoder is used to extract features from multi-view sequences, while the text encoder is used to obtain textual knowledge. To optimally harness the temporal information in DSA sequences, we introduce a temporal pooling module that dynamically compresses image features in the time dimension. Additionally, we design a multi-view contrastive loss to enhance the network’s image-text representation ability by constraining the image features between two views. In a large data set with 2,026 patients, the proposed CLIP-DSA achieved an AUC of 90.8\% in the CVDs classification. The code is available at this website~\footnote{\url{https://github.com/***/}}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2126_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{XieQih_CLIPDSA_MICCAI2025,
        author = { Xie, Qihang and Zhang, Dan and Liu, Mengting and Zhang, Jianwei and Su, Ruisheng and Shan, Caifeng and Zhang, Jiong},
        title = { { CLIP-DSA: Textual Knowledge-Guided Cerebrovascular Diseases Recognition in Multi-View Digital Subtraction Angiography } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {68 -- 77}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes CLIP-DSA, a framework that leverages a pre-trained vision-language model (CLIP) for classifying cerebrovascular diseases (CVDs) in multi-view digital subtraction angiography (DSA) sequences. The approach includes a dual-branch vision encoder for the anterior-posterior (AP) and lateral (LA) views, a frozen text encoder for extracting textual knowledge, a temporal pooling module for compressing spatiotemporal features, and a multi-view contrastive loss that aligns both AP and LA features. The method is validated on a large DSA dataset (2,026 patients), showing improved classification performance across multiple disease categories.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1.Leveraging a pre-trained CLIP model in a medical scenario is a good solution, and it can help bring in extra information from language to improve classification. 2.Many DSA studies only look at a single view, but this paper processes both AP and LA views and then uses a contrastive loss to better align and combine these perspectives. 3.The proposed TPM module is interesting. TPM tries to highlight the most valuable frames in each sequence. This is helpful for DSA data, where certain frames may show disease signs more clearly than others. 4.The experiment results are impressive. The paper tests on a large real-world dataset of 2,026 patients, covering multiple diseases. Compared to over 10 baselines including 2D CNNs, 3D CNNs, RNNs, and other vision-language systems, the proposed method shows better performance.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    I have two major concerns:

    1. There is insufficient description of how text prompts and the text encoder in CLIP are handled. For specialized medical data, relying on CLIP’s general-purpose feature extraction might result in missing or insufficiently detailed information. The paper should provide a more detailed explanation of this aspect of the approach.

    2. Regarding the contrastive learning component, it is unclear whether forcing similarity across the two views is appropriate. Since each view captures the vasculature from a different angle, strictly aligning features from both views could lead to overfitting or limit the ability of the multi-view model to learn distinct, complementary representations.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    1.Supplement Grad-CAM results with ground-truth lesion annotation or radiologist-marked regions, quantitatively verifying whether the model is truly highlighting pathological areas. 2.Include more thorough experiments or discussion about the runtime and memory cost of the approach (particularly the dual-branch design, the TPM module, and the multi-view contrastive loss) as the number of frames or views increases.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, CLIP-DSA represents a promising approach that unites textual knowledge and multi-view spatiotemporal data to classify cerebrovascular diseases. The results are encouraging, particularly the strong AUC gain over other baselines. However, there are still several concerns remained and need to be further discussed. Now, I recommend deferring the acceptance decision after the rebuttal, and I am willing to raise my rating if the authors thoroughly address the concerns I have outlined.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The author feedback really addressed my concerns and I am willing to raise my rating and recommend to accept this paper.



Review #2

  • Please describe the contribution of the paper

    This paper proposes CLIP-DSA that leverages textual knowledge to guide the classification of common CVDs in bi-plane DSA sequences. It introduces a Temporal Pooling Module (TPM) to integrate the temporal correlations among sequential frames in DSA sequences. And it further employs a multi-view contrastive loss to improve image-text representation by aligning features across views. This paper is a decent application paper with some kind novelty.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper proposes CLIP-DSA to guide the robust classification of CVDs in bi-plane DSA sequences.
    2. This paper proposes a Temporal Pooling Module (TPM). This module dynamically compresses image features in time dimension to better integrate the temporal correlations among sequential frames in DSA sequences.
    3. This paper proposes a multi-view contrastive loss to improve image-text representation by aligning features across views.
    4. Finally, It achieves promising results on CVDs classification compared to baselines.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    I list my issues as follows.

    1. I think this paper used data could not called as “Multi-view”, it is just bi-plane DSA. You may refer to 4DRGS (https://arxiv.org/abs/2412.12919) which uses multi-view DSA, where the C-arm gantry would rotate during scanning process. While bi-plane setting is fixed position gantry, capturing Anterior-Posterior view and Lateral view like your case.
    2. In section 2.3, There is a typo. “the single-view features S^{la}{en} and S^{la}{en}” should be “the single-view features S^{ap}{en} and S^{la}{en}”.
    3. In your results, you only provide MV results. But I think AP and LA results could also be evaluated with their own branch, right?
    4. In section 3.2, “these methods also incorporated a dual-branch image encoder with shared weights, along with three classification heads corresponding to the AP view, LA view, and multi-view, similar to CLIP-DSA”, as for multi-view, would the features from AP/LA view be processed through TPM? would the features be concatenated? It is not clear.
    5. The term used in Table 2 is not appropriate. “Methods” should be “Components”, “TPM” should be “Pooling Module”.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper proposes CLIP-DSA for CVDs classification in bi-plane DSA seuqnece, with a temporal pooling module and multi-view contrastive loss, and achieve good results. The overall quality of this paper is a decent application paper with some kind novelty. Some points are not so clear as I listed in the weakness part.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed my concerns in the rebuttal.



Review #3

  • Please describe the contribution of the paper

    The paper presents CLIP-DSA, an application of CLIP model with temporal pooling module for cerebrovascular disease classification in digital subtraction angiography. It proposes novel training methodology using cross-entropy loss and multiview text-image similarity computation for learning features.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Excellent flow of paper. Easy to read and understand methodology in text and figures.
    • Good methodology that utilizes all data available to train the model effectively
    • Improved results on the test split employed across multiple baseline comparisons and ablations.
    • Helpful grad-cam visualizations
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Results shown only from one dataset. Experiments are not cross validated either, hence can’t judge the statistical significance of improvement. May be overfit to this particular split.
    • Some baselines were not retrained on the dataset used. Hence, the proposed method possibly had an unfair advantage of knowing the data distribution.
    • Unclear how the final output class was predicted from final embeddings or SIM matrices. See detailed comments.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Multiview contrastive loss:

    • Considering only SIM_{t2v} while is bxb, and the one-hot encoded target Y is bx5, how is the cross entropy loss defined? Is the Y different from what I am understanding it to be?
    • How do you get the final predicted class (out of the 5 classes) from the learnt embeddings? Is it the max of correlations with all 5 sentences? (or the following which likely not the case) Is the text data also fed as an input during testing to compute SIM? This must be clarified in the paper itself.

    • Unclear what the negative samples are for L_mc. If I am understanding it correctly it is all other samples in the batch. If that’s correct, does L_mc encourage the corresponding SIM matrix to be diagonal?

    Results:

    • Many of the rows of the confusion matrices don’t add up to one. May be rounding error?
    • Would be good to specify which of the baselines were retrained and which weren’t.

    Misc:

    • The statement on having redundant and complex data input reduces model performance requires a reference to support it.
    • Would be good to include data demographics and class distribution in the final version.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposed a good methodology. Lack of experiments on multiple datasets and cross-validated experiments is the main drawback. There’s also some clarification on methodology required.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    All concerns were clarified. While cross-validated and/or multi-centre dataset would be helpful in judging its generalizability, the reported metrics show consistend improvement over many existing methods.




Author Feedback

We appreciate your positive feedback on our technical novelties (e.g., “good solution” by R1, “decent” and “kind novelty” by R2, and “novel training methodology” and “excellent flow” by R3), and the effectiveness (e.g., “results are impressive” by R1, “promising results” by R2, and “improved results” by R3).

To R1:

  1. Text prompts As shown in Fig. 2, we construct text prompts in the same way as CLIP and feed them into the CLIP text encoder. Despite lacking detail, these prompts are simple and effective, enabling the model to quickly associate textual cues with CVDs. While using full diagnostic reports (DR) with a trainable text encoder could potentially offer richer clinical context, the variations across reports written by different radiologists pose a challenge for extracting consistent pathological descriptions. Effectively processing such complex and diverse text remains an open research problem. As part of our future work, we aim to develop a DSA-based question-answering model for CVDs that can leverage information from DR.
  2. L_mc The aim of forcing feature similarity across the two views is to enhance the extraction of their shared pathological features. The experimental results in the 4th and last rows of Table 2 also demonstrate its effectiveness. Additionally, the contrastive learning between each view and the text is designed to highlight their unique characteristics, thereby achieving complementarity.
  3. a) Clinicians have confirmed the GradCAM results, and CLIP-DSA focuses more on lesion areas than CLIP. We appreciate your suggestion. b) Discussing the runtime and memory cost of this component is meaningful, and we will include such comparisons in our future work. Thanks for the suggestion.

To R2:

  1. You are right. We will revise our description regarding “multi-view”.
  2. Thanks, we have revised the typo.
  3. You are correct. We can provide results for AP and LA branches: AP (77.3, 75.3, 85.4); LA (80.4, 79.4, 88.0). However, in Table 1, the AP and LA results of the other methods are based on single-view inputs, which makes direct comparison unfair. Our method employs a L_mc to align features between the AP and LA views, making it unsuitable for single-view training. Therefore, as noted in Section 3.2, we report only the MV results.
  4. Apologies for the confusion. In other comparison methods, features from the two single views are directly concatenated without using TPM, as it’s our proposed module.
  5. Thanks, we have revised the term.

To R3:

  1. Experiment Collecting multi-center data is challenging, but we are currently acquiring data from another hospital. Conducting full cross-validation (FCV) also demands time and resources. Given these constraints, we use a fixed random seed to select a single validation fold, providing a reasonable and reproducible evaluation. In future work, we plan to include multi-center data and perform FCV to further validate the robustness of our method.
  2. All baselines, including ours, use their own pre-trained parameters and are fine-tuned on the DSA dataset, ensuring a fair comparison.
  3. Multiview contrastive loss We apologize for lacking a detailed description here. During training, Y is not actually b×5, but b×b, with each sequence pair corresponding to one sentence- note that sentences can be repeated. The CE loss is computed in the same way as in CLIP. The image encoder receives a sequence pair during inference, while the text encoder takes all 5 sentences as input. Then, the similarity between the image features and all 5 sentence features is calculated, and the class with the highest similarity is selected as the final classification result.
  4. Your understanding of L_mc is correct; it encourages the SIM matrix to be diagonal.
  5. The rows of the confusion matrix don’t sum to 1 indeed due to rounding errors.
  6. As shown in (10.1109/TMI.2025.3540886), increasing the frame number can degrade performance. We will include the data details in the final version as space permits.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    Authors should focus on addressing critical comments from the reviewers. However, In the rebuttal stage, authors should refrain from providing additional experimental results.

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    N/A

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top