Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Perivascular spaces (PVS), also known as Virchow-Robin spaces, are critical biomarkers for diagnosing cerebral small vessel disease (CSVD). Quantifying PVS visible in magnetic resonance imaging (MRI) is essential for understanding their relationship with various neurological disorders. Traditional methods for assessing PVS rely on visual scoring of MRI images, which is time-consuming, subjective, and unsuitable for large-scale studies. Additionally, due to their small size, scattered distribution, and complex morphology, PVS can easily be confused with neighboring structures, posing significant challenges for their accurate extraction. In this paper, we propose a novel graph interaction-enhanced model based on vision-language modeling (VLM) technology for accurate PVS extraction from MRI. Our approach leverages textual information to guide image feature extraction and employs a graph structure to enhance cross-modal interactions, facilitating the reasoning of relationships between different modalities. Furthermore, we introduce a cross-modal attention mechanism for global feature alignment and an attention-based dynamic fusion module to effectively integrate multi-modal information, improving the accuracy of PVS segmentation. Validated on an independent T1-weighted dataset, our model demonstrates superior performance in capturing both global and local information, addressing the limitations of traditional image-only models and providing a robust solution for PVS segmentation in complex clinical scenarios.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1824_paper.pdf

SharedIt Link: https://rdcu.be/eHaVt

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04965-0_11

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{CheTao_CrossModal_MICCAI2025,
        author = { Chen, Tao AND Zhang, Dan AND Long, Xi AND Breeuwer, Marcel AND Zinger, Sveta AND Huang, Peiyu AND Zhang, Jiong},
        title = { { Cross-Modal Graph Learning for Perivascular Spaces Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},
        page = {111 -- 120}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper proposes a vision-language model with graph representation learning for Perivascular Spaces Segmentation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This paper improved PV segmentation performance with text cues.
2. From the ablation studies, several modules proposed in the paper have improved performance for PVS.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

-Writing: I suggest the keywords should be written in Abbreviation or full name together.

Some obvious writing mistakes. E.g. “The proposed PVS segmentation framework”. I think the S in PVS represents segmentation?

Incomplete description of formulas for some methods, with discrepancies in Fig. 2. E.g. missing softmax in A_dj^I-T.

Some common sense mistakes. E.g. “In the language-vision framework, OpenAI’s BERT model is used as the text encoder.” One thing everyone knows is that BERT was proposed by Google in 2018. Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 2019.

Some probably conceptual errors. E.g. “The graph structure can effectively model local relationships and topological structures but has limitations in capturing global relationships.”. It is known that in image analysis, after aggregating image features into graph embedding using graph projection, the graph convolution operation is at the non-local level for obtaining long term dependencies of the image features rather than focusing on local regions as in convolutional neural networks performing.

Lack of description for text cue in the study. From radiology report or generation by LLM or text template?

-Method: Transformer can be viewed as a kind of graph interaction operation on a fully connected graph. In the Multimodal Graph Interaction Enhancement Module, why not use Transformer to replace GCN? Or just use cross-attention as used in most VLM methods, which is enough for multi-modal interaction.

The cross-attention mechanism was used both before and after the graph interaction, and as the conceptual errors and comments have been mentioned above(e.g. both GCN and cross-attention are non-local computing), I think there is a lot of duplication of functionality in the design of these modules, which looks like a kind of splicing of the approach of a couple of previous papers, but of course perhaps an improvement, and I hope that the authors will be able to explain the connection between these designs well. Wu, Tianyi, et al. “GINet: Graph interaction network for scene parsing.” Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16. Springer International Publishing, 2020. Wang, Zhaoqing, et al. “Cris: Clip-driven referring image segmentation.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

Besides, I noticed that MCA is a little different from standard Cross-Attention in the post attention section. What is the purpose of such a change?

-Experiments: Lack of comparison of medical image segmentation methods based on VLM. Lack of ablation on Multimodal Cross-Attention Module and Multimodal Graph Interaction Enhancement Module, that is “Backbone + VLM + MCA (+ MFF)”.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

PVS segmentation -> PV segmentation
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The writing, proposed method and experiments.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The author responded to some of my concerns and addressed some of my doubts.

Review #2

Please describe the contribution of the paper

The paper introduces a vision-language modeling (VLM)-based framework with graph interaction modules to improve PVS segmentation in MRI. Key innovations include: 1) Text-guided feature extraction to address small, low-contrast PVS structures. 2) Cross-modal graph interaction to model local relationships between image and text features. 3) Global feature alignment via cross-modal attention and dynamic fusion for multi-modal integration.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Innovation: 1) First use of VLM for PVS segmentation, leveraging textual priors to distinguish PVS from neighboring structures . 2) Graph convolutional networks (GCNs) enhance fine-grained feature interactions, improving boundary detection .

Experimental Rigor: 1) Outperforms baseline methods (e.g., nnUNet, SwinUNet) with a Dice score of 60.42% on a private T1-weighted dataset . 2) Ablation studies validate the impact of each module (VLM, MGIE, MCA, MFF), showing stepwise accuracy improvements .

Clinical Relevance: 1) Designed for 3T MRI, aligning with clinical standards, unlike prior methods relying on 7T data.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1) Generalization:

Limited to T1-weighted images; no validation on multi-modal data (e.g., T2/FLAIR), where PVS visibility may differ .

Fixed graph parameters ( $\mathrm{K}=10$ nodes, $\mathrm{K}-\mathrm{NN}=5$ edges) lack sensitivity analysis, raising questions about robustness .

2) Reproducibility:

No public code/data repository; “private dataset” details (e.g., patient demographics, acquisition protocols) are insufficient for external validation .

BERT text encoder details (e.g., pre-training domain, fine-tuning) are unclear, risking semantic bias .

3) Comparative Scope: Omits recent VLM-based medical methods (e.g., MedCLIP, BLIP-2), potentially underestimating competitive landscape.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Suggestions: 1) Validate on multi-modal datasets to demonstrate generality. 2) Publish code/data and perform parameter sensitivity analyses. 3) Compare with state-of-the-art VLM models in medical imaging.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper’s cross-modal approach addresses a critical clinical need for automated PVS segmentation. However, concerns about data diversity, parameter transparency, and comparative benchmarking require resolution. A “Weak Accept” acknowledges its potential while urging improvements.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The main contribution of the paper lies in its novel methodological approach that combines Vision-Language Modeling (VLM) with Graph Convolutional Networks (GCN) for perivascular space (PVS) segmentation—an integration that has been largely unexplored in this domain. While prior work has applied both traditional and deep learning techniques to PVS segmentation, the incorporation of textual guidance through VLM and the use of graph-based cross-modal interaction introduces a new paradigm. This cross-modal framework enables the model to more effectively identify small, scattered, and morphologically complex PVS structures, addressing key limitations of image-only approaches. Although VLMs and GCNs have been individually applied in other contexts, their tailored synergy for PVS segmentation is both novel and potentially impactful.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Cross-modal integration: By combining image and text through Vision-Language Modeling (VLM) and graph interaction, the method enhances semantic understanding—crucial for detecting small, low-contrast PVS structures.
- Graph-based feature refinement: The use of graph structures allows effective modeling of spatial and relational context between features, helping in distinguishing PVS from surrounding anatomy.
- Advanced attention mechanisms: Cross-attention and dynamic multimodal fusion modules contribute to fine-grained global and local feature alignment, which boosts segmentation accuracy.
- Handles clinical challenges: Specifically addresses real-world segmentation issues like PVS complexity, scattered distribution, and morphological variability.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Complexity and reproducibility: The architecture involves multiple advanced components (e.g., graph construction, cross-modal attention, K-NN relationships), making it computationally heavy and potentially hard to reproduce.
- Dataset limitations: The method is validated only on a T1-weighted private dataset. Generalizability to other modalities (e.g., T2-weighted or FLAIR) or unseen clinical environments is not evaluated.
- Lack of ablation study: The paper may not clearly isolate the individual contributions of each module (e.g., MGIE, cross-modal attention), making it harder to understand the impact of each design choice.
- Limited comparison with other VLM-based methods: The novelty of integrating VLM in this domain is strong, but comparison with similar cross-modal approaches in medical imaging is lacking, which could help validate its advantages.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The work presents a meaningful application of vision-language modeling and graph interaction for improved PVS segmentation, which could support clinical decision-making in neurology. While the methodology builds on existing frameworks, its integration and demonstrated performance mark a valuable contribution beyond incremental progress.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank the reviewers for taking time to review our paper and we appreciate their positive feedback to the effectiveness of our method (e.g., “ … accuracy improvements.” by R1. “ … improved performance for PVS. “ by R2.) and to our technical novelty (e.g., “… novel and potentially impactful.” by R3). Here we provide our point-to-point responses to address their concerns.

Q1：Dataset Usage. (R1/R3) As the first study to apply VLM to PVS segmentation, we chose well-standardized T1-weighted images in order to validate its effectiveness and plan to include multi-modal imaging in future work.

Q2：Parameter sensitivity analysis. (R1) Graph parameters were set to balance representation and efficiency, ensuring sufficient feature coverage while reducing computational cost. We acknowledge the importance of sensitivity analysis and will explore this in future work to enhance robustness.

Q3：BERT Misreference. (R1/R2) We appreciate your feedback. To reduce semantic bias and improve domain alignment, we fine-tuned it on our own PVS-specific radiology reports. We will clarify and correct related details in future revisions. In addition, we will correct the source of Bert.

Q4：Comparative Experiments. (R1/R2/R3) As the first baseline to explore VLMs for 3D PVS segmentation, we focused on comparisons with classic 3D models and specialized PVS segmentation methods. We appreciate the reviewer’s suggestion and will comparisons with VLM-based approaches in future work.

Q5：Formulation of the problem. (R2) In our work, PVS refers to the target—Perivascular Spaces—and thus we use the term “PVS segmentation.” The text prompts were derived from radiology reports to ensure domain relevance. Additionally, while we applied softmax to normalize the adjacency matrix (Fig. 2), this was omitted from the equation and will be corrected for clarity.

Q6：Description of the Concept. (R2) The graph structure is inherently flexible, with its modeling capacity determined by graph construction and information propagation strategies. In our study, the graph is built via local semantic clustering of encoder features, enabling the GCN to focus on local topological structures. We also agree that deeper or non-local GCNs can similarly capture global dependencies.

Q7：Comparison between GCN and Transformer. (R2) PVS are small and fine in 3D brain structures. While Transformers model global dependencies well, they lack local inductive bias, making them prone to overfitting and less effective for fine structures with limited PVS data. Tokenizing 3D volumes is also computationally costly. GCNs are thus more suitable for this task.

Q8：Module Design and Integration. (R2) Our method is not a simple stacking of components. We use a cross-attention-based VLM framework to align semantic cues with visual features. However, cross-attention lacks awareness of spatial structures such as the adjacency and continuity of PVS. Therefore, we design a graph interaction enhancement module to capture local topological relationships and improve PVS detection—unlike GINet, which lacks subsequent graph-based enhancement. Additionally, our decoder incorporates a multimodal fusion module to better reconstruct fine structural details under textual guidance, which CRIS does not consider.

Q9：Differences from Standard Cross-Attention. (R2) The proposed MCA jointly enhances image and text features for stronger multimodal representation learning. Beyond the standard QKV mechanism, it incorporates concatenation, residuals, and a feed-forward network to improve global understanding and fine-grained coupling, benefiting tasks like PVS segmentation.

Q10：Ablation Studies. (R2/R3) In the ablation study, our primary goal was to validate the effectiveness of the VLM framework. To better leverage text for guiding image understanding, we introduced different modules. Therefore, we conducted step-by-step ablation experiments to assess the performance gains of each module under controlled conditions.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

All reviewers have reached a consensus to accept the paper.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Most of the reviewer’s comments have been significantly addressed. This is an interesting 3D medical segmentation approach using multimodal graph framework.

back to top

Cross-Modal Graph Learning for Perivascular Spaces Segmentation

Author(s):