Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Recent advances in multimodal techniques have led to significant progress in Medical Visual Question Answering (Med-VQA).However, most existing models focus on global image features rather than localizing disease-specific regions crucial for diagnosis. Additionally, current research tends to emphasize answer accuracy at the expense of the reasoning pathway, yet both are crucial for clinical decision-making. To address these challenges, we propose From Vision to Text Chain-of-Thought (V2T-CoT), a novel approach that automates the localization of preference areas within biomedical images and incorporates this localization into region-level pixel attention as knowledge for Vision CoT. By fine-tuning the vision language model on constructed R-Med 39K dataset, V2T-CoT provides definitive medical reasoning paths. V2T-CoT integrates visual grounding with textual rationale generation to establish precise and explainable diagnostic results. Experimental results across four Med-VQA benchmarks demonstrate state-of-the-art performance, achieving substantial improvements in both performance and interpretability.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1207_paper.pdf

SharedIt Link: https://rdcu.be/eHwU3

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04971-1_62

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/1207_supp.zip

Link to the Code Repository

https://github.com/Venn2336/V2T_CoT

Link to the Dataset(s)

N/A

BibTex

@InProceedings{WanYua_V2TCoT_MICCAI2025,
        author = { Wang, Yuan AND Liu, Jiaxiang AND Gao, Shujian AND Feng, Bin AND Tang, Zhihang AND Gai, Xiaotang AND Wu, Jian AND Liu, Zuozhu},
        title = { { V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {658 -- 668}
}

Reviews

Review #1

Please describe the contribution of the paper

V2T-CoT introduces a multimodal reasoning framework that significantly enhances both the accuracy and interpretability of medical visual question answering tasks.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Experimental results demonstrate that V2T-CoT achieves state-of-the-art performance on four medical visual question answering benchmarks.
2. Constructe a Med-VQA dataset with instruction tuning.
3. The Vision CoT is capable of locating disease-related visual cues, enhancing the model’s focus on critical regions.
4. The Text CoT provides a diagnostic reasoning path, improving accuracy, interpretability, and transparency.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
More details about the R-Med 39K dataset need to be provided.
1. How was the expert validation conducted? Was it based on assessing and filtering the generated structured diagnostic rationale, or was it done through a human-in-the-loop approach?
2. Was the entire dataset subjected to human evaluation?
3. Was the validation performed by medical professionals, and if so, what were their qualifications?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well-structured, with theoretical analysis aligning with the conclusions, and the experiments are comprehensive.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper proposed V2T-CoT, a novel framework for Medical Visual Question Answering (Med-VQA) designed to enhance both diagnostic accuracy and interpretability. This is achieved by explicitly integrating visual grounding with textual reasoning generation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Novel Integrated Framework: The core idea of tightly coupling explicit visual region localization (Vision CoT) with the generation of a textual reasoning chain (Text CoT) is a strong conceptual contribution. While components like visual grounding and CoT exist, their direct, synergistic integration for explainable Med-VQA is well-motivated and novel in this context. Valuable Dataset Resource (R-Med 39K): The construction and planned release of the R-Med 39K dataset, specifically designed for instruction-tuning Med-VQA models with reasoning paths, is a significant practical contribution to the community. The described generation and validation pipeline (using LLMs and expert checks) seems reasonable for creating a large-scale resource.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

R-Med 39K Generation Details & Quality Control: The rationale generation relies heavily on LLMs (GPT-4/Gemini). While a verification step (LLM + expert) is mentioned, the paper lacks detail on the scale and rigor of the expert validation. What proportion of generated rationales required correction or were discarded? Reliance on LLMs for generation and initial verification might propagate biases or factual errors present in the base LLMs. More transparency on the quality control process would strengthen this contribution. Computational Overhead: The proposed framework adds components (phrase grounding detector, potentially more complex attention) compared to standard VQA models. An analysis of the computational cost (e.g., inference time, parameter increase) compared to baselines would be beneficial for understanding practical deployability.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a well-motivated and novel framework (V2T-CoT) that effectively integrates visual grounding with chain-of-thought reasoning for Med-VQA.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper introduces V2T-CoT, a vision-language framework for medical visual question answering that integrates visual region grounding and chain-of-thought reasoning. The method combines automated localization of disease-relevant image regions (Vision CoT) with instruction-tuned rationale generation (Text CoT) to improve both answer accuracy and interpretability. The authors also construct a new instruction dataset (R-Med 39K) to support reasoning supervision and demonstrate performance gains across four Med-VQA benchmarks.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

· The proposed V2T-CoT framework offers a clear design that jointly models visual region localization and step-by-step reasoning, which is well aligned with the need for interpretability in clinical VQA tasks. · The use of automated phrase grounding and regional attention adds a spatial reasoning layer that distinguishes this method from prior global-feature-based approaches. · The construction of the R-Med 39K dataset, combining multiple benchmarks with structured rationales, enables supervised training of reasoning chains without requiring full manual annotation. · Experimental results are consistent and well-validated across four Med-VQA datasets, including both closed and open-ended questions, with ablation studies and rationale quality evaluations that support the method’s claims.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

· The paper claims to improve region-level understanding, but the evaluation focuses more on detecting general anatomical regions rather than verifying whether the model actually focuses on disease-relevant areas. It’s hard to tell if this really solves the “localization” issue raised in the intro. · The construction of the R-Med 39K dataset is a big part of this work, but the description of how the rationales were verified—especially by human experts—is a bit vague. Since this dataset is core to the training, it’d help to see more details on how quality was ensured. · The idea of interpretability is central to the paper, but apart from average scores for rationale quality, there isn’t much analysis of how the model reasons or where it fails. A few concrete examples would really help illustrate whether the reasoning is actually useful in practice.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper tackles a real challenge in medical VQA and offers a promising direction with a solid method, even though parts of the validation could be stronger.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank reviewers (R1, R3, R5) for their constructive comments and acknowledgment of our novel method in addressing the challenges of MedVQA tasks. Q1: R-Med 39K Dataset (R1, R3, R5) A1: For the construction of the R-Med 39K dataset, we implemented rigorous generation and validation processes to ensure data quality. (1) We first utilized large language models (GPT-4o and Gemini-2.0-flash-lite) to generate preliminary diagnostic reasoning. Given the tuple <Question (Q), Image (I)>, the vision-language model produced answer (A). Subsequently, based on the triplet <Q, I, A>, structured diagnostic rationales (R) were generated through reasoning prompt strategies. (2) Generated rationales underwent logical consistency evaluation by an independent language model. To ensure clinical validity, each generated R was validated through an independent language model assessed its logical consistency with A and I. This process yielded verified quadruples <Q, I, A, R> for instruction fine-tuning. (3) Implemented cross-validation for a subset of <Q, I, A, R> tuples through manual expert review. Our human validation protocol comprised: : Using stratified random sampling (by disease type and imaging modality), 20% of tuples were selected for double-blind scoring by Group1 (1 medical graduate student + 1 licensed physician). A Likert 5-point scale evaluated across three dimensions: diagnostic accuracy, logical coherence, and clinical utility. Inter-rater reliability (Cohen's Kappa coefficient) was calculated to ensure consistency. : To mitigate verification bias, the second phase implemented cross-verification sampling. Group2 (1 medical graduate student + 1 licensed physician) re-evaluated 50% of validated data through systematic sampling and collected 10% new samples from unvalidated data using proportional stratified sampling. After weight calibration, two-sample t-test confirmed no significant inter-group differences (p>0.05) between verification results. : Inverse-variance weighted averaging was applied to integrate validation phases, reducing sampling error impact. Bootstrap resampling (1000 iterations) was used to calculate 95% confidence intervals, ensuring statistical robustness. Our scoring criteria were: 5: Fully supports the answer with an accurate explanation. 4: Supports the answer but lacks details in the explanation. 3: Provides partial support for the answer but is incomplete. 2: Weakly supports the answer with missing information. 1: Does not support the answer; incorrect or irrelevant. Finally, we retained all data entries rather than discarding lower-scoring instances, as the scoring system encompassed both fully supportive rationales and those failing to adequately justify answers. Q2: Analysis of Methodological Innovation and Interpretability (R3, R5): A2: We recognize that the analysis of model interpretability in the paper remains insufficiently thorough. Therefore, during revision, we will supplement more specific case studies and analyses. In addition to the existing average reasoning quality scores, we will incorporate detailed interpretations of the model's reasoning process to demonstrate how the model progressively constructs diagnostic reasoning based on visual and textual cues. Simultaneously, we will conduct in-depth examinations of representative successful and failed cases to explore the model’s performance and underlying causes in these scenarios. Taking the case study in Figure 3B as an example, V2T-CoT can precisely localize critical anatomical structures like the liver through Vision CoT, illustrating how the model leverages visual cues to generate accurate diagnostic reasoning paths, thereby enhancing both reasoning accuracy and interpretability. Through these supplementary analyses, we aim to comprehensively demonstrate the model’s interpretability and provide clearer insights into its strengths and limitations in practical application

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis

Author(s):