List of Papers Browse by Subject Areas Author List
Abstract
In high-stakes medical applications, consistent answering across diverse question phrasings is essential for reliable diagnosis. However, we reveal that current Medical Vision-Language Models (Med-VLMs) exhibit concerning fragility in Medical Visual Question Answering, as their answers fluctuate significantly when faced with semantically equivalent rephrasings of medical questions. We attribute this to two limitations: (1) insufficient alignment of medical concepts, leading to divergent reasoning patterns, and (2) hidden biases in training data that prioritize syntactic shortcuts over semantic understanding.
To address these challenges, we construct RoMed, a dataset built upon original VQA datasets containing 144k questions with variations spanning word-level, sentence-level, and semantic-level perturbations. When evaluating state-of-the-art (SOTA) models like LLaVA-Med on RoMed, we observe alarming performance drops (e.g., a 40% decline in Recall) compared to original VQA benchmarks, exposing critical robustness gaps.
To bridge this gap, we propose Consistency and Contrastive Learning (CCL), which integrates two key components: (1) knowledge-anchored consistency learning, aligning Med-VLMs with medical knowledge rather than shallow feature patterns, and (2) bias-aware contrastive learning, mitigating data-specific priors through discriminative representation refinement. CCL achieves SOTA performance on three popular VQA benchmarks and notably improves answer consistency by 50% on the challenging RoMed test set, demonstrating significantly enhanced robustness. Code will be released.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3893_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
N/A
Link to the Dataset(s)
N/A
BibTex
@InProceedings{JiaSon_Knowing_MICCAI2025,
author = { Jiang, Songtao and Chen, Yuxi and Song, Sibo and Zhang, Yan and Jin, Yeying and Feng, Yang and Wu, Jian and Liu, Zuozhu},
title = { { Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15970},
month = {September},
page = {330 -- 340}
}
Reviews
Review #1
- Please describe the contribution of the paper
1.The authors construct a comprehensive dataset called RoMed to address the lack of robustness evaluation in existing Med-VQA systems and to train for CCL method. 2.The authors propose the CCL method and verified its effectiveness through comparative evaluation and ablation experiments.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1.The research motivation is clear. The paper reveals the problem that medical visual language models (Med-VLMs) give inconsistent answers to semantically equivalent but differently expressed medical questions, and proposes an effective solution to this problem. 2.The research method is simple and effective. Based on the RoMed dataset, the model performance is significantly improved by only modifying the training objective function without modifying the model structure.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1.There is a lack of detailed description of the RoMed dataset: ① Why do you use three different perturbation forms to construct perturbed data? The necessity of each is not proved by experiments; ② What is the current proportion set of three different perturbation forms and why is that? ③ What is the impact of the proportion of different perturbation forms on CCL method training? 2.Lack of off-set verification experiments, such as PMC-VQA, Medical-CXR-VQA, GEMeX and other common VQA datasets, which are necessary to verify the generalizability of RoMed dataset and the robustness of the CCL method. 3.Except for SFT methods, there is a lack of performance comparison with RL methods, which are increasingly showing superior generalization and robustness for Med-VLMs.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
The font size of the image/tabel descriptions is too small to read……
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Consistent with the major weaknesses section.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
This paper proposes and constructs a novel dataset, RoMed, specifically designed for evaluating the robustness of medical visual question answering models, with the aim of more realistically simulating the diverse expressions of patients in clinical settings. In order to enhance the model’s robustness to perturbations without sacrificing traditional accuracy metrics, a novel joint consistency and contrastive learning (CCL) framework is innovatively designed. Experimental results on multiple datasets validate the effectiveness of the proposed approach and its potential for clinical applications.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The proposed CCL framework addresses both knowledge alignment and data bias issues, thereby rendering the overall training objective more comprehensive. In addition, the contrastive learning component effectively suppresses the overfitting tendency of shallow features, enhancing the model’s generalization capability for various phrasings.
- The manuscript presents comprehensive and in-depth experiments that not only demonstrate performance improvements on standard VQA datasets but also include the specifically designed RoMed dataset to assess model robustness. The use of multiple evaluation metrics (Recall, Accuracy, CV, MAD) enables a more thorough and objective reflection of the model’s performance in practical application scenarios.
- Driven by practical application needs, the research approach and technical implementation proposed in this manuscript introduce a novel evaluation perspective focused on robustness and consistency, which has promising implications for advancing clinical applications.
- The schematic diagrams and result figures in the manuscript are exceptionally well-designed.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Although the paper focuses on robustness under question perturbations, it does not compare against general-domain VQA models that also address robustness or adversarial question reformulation (e.g., work in VQA-CP or CLEVR-Hans). Including such baselines would help better position the proposed method and demonstrate its distinct advantages.
- While RoMed introduces multiple types of linguistic perturbations, the paper lacks a detailed taxonomy or analysis of their individual effects. It is unclear how the model performs across different categories (e.g., synonym substitution, negation, rephrasing), which could offer valuable diagnostic insight.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Reasons for recommending a score of 4 are as follows. The paper offers a novel and clinically meaningful perspective on robustness in medical visual question answering by addressing answer consistency under semantically equivalent question perturbations—an underexplored yet important issue. The proposed RoMed benchmark is thoughtfully constructed and enables evaluation beyond standard accuracy, while the CCL framework provides an effective dual-branch learning strategy that improves both robustness and performance. Experimental results are strong across multiple datasets and modalities. Although the paper could be strengthened by comparisons to robustness-aware VQA baselines and human evaluation, the core contributions are original and well-supported.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
The paper addresses the fragility of current Medical Visual Question Answering (Med-VQA) systems by identifying that models often provide inconsistent answers when faced with semantically equivalent perturbations.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Combined Learning Objectives: The integration of consistency learning with contrastive learning in a joint framework represents a novel approach. Instead of merely relying on increased data variability or fine-tuning, the method directly tackles model overconfidence and inconsistency. Comprehensive Benchmarking: The method is thoroughly evaluated on several standard Med-VQA benchmarks (e.g., Rad-VQA, SLAKE, PathVQA) as well as on the newly constructed RoMed dataset.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Generation Process: The construction of RoMed relies on a multi-agent system (including models such as HuatuoGPT-Vision, HuatuoGPT-o1, and GPT-4o) to generate and validate question perturbations. This dependency may limit reproducibility and raises questions about generalization if such agents are not equally available or if their internal biases propagate into the dataset.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper provides a timely contribution by pinpointing a critical limitation in current Med-VQA systems—namely, their vulnerability to question perturbations—and offers both a novel dataset (RoMed) and a robust training framework (CCL) to address the problem. The dual approach of combining consistency learning with contrastive learning is well-motivated and substantiated by comprehensive experimental results, including strong ablation studies that clarify the contribution of each component.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
To Reviewer #1: Thank you for your insightful feedback and acknowledgment!
Q1.1: In fact, the three types of perturbations represent a comprehensive exploration of model robustness from the textual perspective, covering three granularities: from coarse-grained sentence-level and semantic-level perturbations to fine-grained word-level perturbations. As shown in Figure 4(e), our preliminary visualization reveals that the vanilla model lacks robustness across all three types of perturbations—its embeddings fail to capture the shared semantic features across different formulations. Therefore, we incorporate perturbations at all three levels in our experiments to fully assess the model’s robustness in complex clinical scenarios.
Q1.2: Currently, the three perturbation types are included in a 1:1:1 ratio. For each original question in the dataset, we generate corresponding perturbations at all three levels. This allows for a more comprehensive evaluation of model performance under various perturbation granularities.
Q1.3: This is a very insightful suggestion! We will discuss these in future work.
Q2: Our evaluation datasets follow those used in LLaVA-Med. Our goal is to explore a more robust SFT and an improved VQA evaluation protocol. Therefore, the current version of our experiments only includes benchmarks used in LLaVA-Med.
Q3: I completely agree that RL may be a promising direction for enhancing generalization! As demonstrated by DeepSeek-R1, methods like R1 exhibit impressive generalization capabilities. However, in this paper, our focus is on improving the generalization ability of SFT, which remains a widely-used post-training approach. Moreover, RL can sometimes underperform SFT in in-domain tasks. Notably, our method achieves both improved in-domain performance and better generalization compared to standard SFT. Thank you again for your constructive suggestions!
To Reviewer #2: We sincerely appreciate your valuable comments and recognition of our work!
Q1: This is a great point! We fully agree that addressing potential internal biases is crucial. As shown in Figure 3, we introduced a checking module after the multi-agent framework, which employs a top-performing model (GPT-4o) to correct potential errors during the perturbation generation process. Since the perturbations primarily involve textual modifications, the accuracy and reliability of this process are relatively high. We will emphasize this aspect in future versions and provide more details on sampling parameters to ensure reproducibility. Thank you again for your constructive suggestions!
To Reviewer #3: We deeply appreciate your constructive feedback and acknowledgment!
Q1: We totally agree that comparing with general-domain approaches is very important. However, works such as CLEVR-Hans and VQA-CP were developed prior to the LLM era, and integrating them into current LLM training pipelines is challenging, as they often require additional trainable modules, which limits their practicality. In contrast, our approach is fully compatible with existing LLM training frameworks—only an auxiliary training objective is added without altering the model architecture. We will further explore comparisons with related work in the context of LLMs in future work. Thank you again for your valuable insights!
Q2: This is indeed an important point. Investigating the effect of each perturbation subset on experimental results would provide further insights. We will discuss these in future work. Thank you once again for your constructive suggestions!
Meta-Review
Meta-review #1
- Your recommendation
Provisional Accept
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A