Abstract

In Medical Visual Question Answering (Med-VQA), accurate interpretation of clinical questions alongside medical images is crucial for reliable diagnostic support. However, conventional methods often exhibit pronounced medical language biases that stem from \textit{imbalanced data distribution} and \textit{question shortcut dependence}, causing models to disproportionately rely on textual priors at the expense of valuable visual semantics. To mitigate this challenge, we propose a novel Med-VQA debiasing approach called ``\textbf{Med-BiasX}’’ that synergistically combines two strategies, i.e., Energy-aware Confidence Constraint (ECC) and Distribution-aware Dependence Calibration (DDC). Specifically, ECC aims to reinforce correct answers and adjust the energy associated with incorrect answers by leveraging the global normalization property of free energy and the intrinsic properties of energy. DDC is designed to shift the model’s dependency from question shortcuts to multimodal information by explicitly measuring the similarity between predicted distributions from different branches and prior distributions. Extensive experiments on multiple medical standard benchmarks and bias-sensitive benchmarks, SLAKE-BIAS and VQA-RAD-BIAS, consistently demonstrate the robustness and superiority of our Med-BiasX approach over state-of-the-art competitors.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/5135_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ZhuHua_MedBiasX_MICCAI2025,
        author = { Zhu, Huanjia and Liu, Yishu and Zhou, Chengju and Lu, Guangming and Chen, Bingzhi},
        title = { { Med-BiasX: Robust Medical Visual Question Answering with Language Biases } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    In this work, the authors propose a new debiasing method for medical visual question answering, which includes two strategies: 1) Energy-Aware Confidence Constraint (ECC), and 2) Distribution-Aware Dependence Calibration (DDC). Aimed at reducing bias caused by imbalanced data distribution and question shortcut dependence, ECC contributes by penalizing question shortcuts, while DDC encourages the model to focus on multimodal features instead of relying solely on the question. The proposed method is evaluated on two medical VQA datasets and their variations.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1.The work stems from a solid motivation—VQA systems are sometimes biased due to imbalanced data and question shortcuts, which deserves more attention from the community. 2.The structure of the paper is clear and easy to follow. 3.The experiments are thorough, including comparisons between the proposed model and state-of-the-art models, as well as an ablation study. 4.The manipulation of the dataset aligns with the overall logic and could be useful for research with similar objectives.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Major weakness: 1.The proposed method does not seem to address all the issues it aims to tackle. As mentioned in the introduction, medical VQA systems can be biased due to a) imbalanced data distribution and b) question shortcut dependence. However, it appears that both ECC and DDC are designed specifically for the second issue—question shortcut dependence. 2.The motivation for using these techniques is not clearly conveyed. For example, the logic behind ECC is that the result of using both modalities should be significantly different from using only the question, in order to avoid question-dominated decision-making. However, this issue has been widely discussed in multi-modal tasks, and many solutions have already been proposed. It would be helpful to explain more clearly why the energy-based method is chosen in this case, and how it performs better than alternative approaches. 3.In Section 2.4, the loss term includes only ECC and DDC losses. What about the loss for the VQA task itself (which I assume is a classifier loss)? Is that integrated into the ECC and DDC losses? Since you have a baseline model without ECC and DDC in Table 3, there must be a separate loss term for the VQA component—this should be clarified.

    Minor weakness: *in the introduction, 2nd paragraph, “Theoretically, medical language biases stems from imbalanced data distribution and question shortcut dependence [25, 5, 10, 13, 19].” → would be helpful to separate the citations into these two categories, rather than listing them all together *x label is missing in fig.4

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    As mentioned in the weaknesses section, I found it difficult to connect the authors’ motivation with the methods they proposed. Both methods seem to target the shortcut learning issue rather than the data imbalance problem. Additionally, if the biased results stem from the dominance of one modality—which is a common issue in multi-modal tasks—this work should also be compared with other techniques that address imbalance between modalities. These are my main concerns regarding this paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors’ reply addressed my concerns, and I would like to change my recommendation to acceptance. According to their response, I realize there was a misunderstanding on my part regarding “imbalanced data distribution” — specifically, it refers to the types of questions (e.g., how, what, etc.) and their corresponding answers in this work. Since data imbalance can take many forms, I suggest the authors clarify this in the final version to avoid potential confusion. Additionally, please modify the learning objective function (loss) to include the base loss term as well.



Review #2

  • Please describe the contribution of the paper

    The main contribution of the paper is the development of a novel debiasing approach for Medical Visual Question Answering (Med-VQA) called “Med-BiasX.” This method addresses the issues of medical language biases that arise from imbalanced data distribution and question shortcut dependence. Med-BiasX synergistically combines two strategies: 1) Energy-aware Confidence Constraint (ECC) - This mechanism reinforces correct answers and adjusts the energy associated with incorrect ones, promoting multimodal learning. 2) Distribution-aware Dependence Calibration (DDC) - This strategy recalibrates the model’s dependency on question shortcuts by measuring the similarity between predicted distributions from different branches and prior distributions. The effectiveness of Med-BiasX is validated through extensive experiments on standard benchmarks and newly constructed bias-sensitive datasets, demonstrating its robustness and superiority over state-of-the-art methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed Med-BiasX approach introduces a unique integration of Energy-aware Confidence Constraint (ECC) and Distribution-aware Dependence Calibration (DDC). This combination is innovative as it effectively addresses the dual challenges of imbalanced data and question shortcut dependence in Med-VQA, which are critical issues in medical diagnostics. The paper constructs two new bias-sensitive datasets, SLAKE-BIAS and VQA-RAD-BIAS, specifically designed to evaluate the debiasing performance of Med-BiasX. This novel dataset creation allows for a more rigorous assessment of model biases and enhances the generalizability of findings. Extensive experiments are conducted on both established medical benchmarks and the newly created bias-sensitive datasets. It provides comprehensive comparisons with state-of-the-art methods, showcasing significant performance improvements across various scenarios, which underscores the robustness of the proposed approach.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While the integration of ECC and DDC is valuable, the individual components themselves are not entirely novel. Similar concepts have been explored in other contexts. A symmetric KL divergence based spatiogram similarity measure. In: 2011 18th IEEE International Conference on Image Processing

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a novel approach to addressing medical language biases in Visual Question Answering, demonstrating significant improvements over state-of-the-art methods. The methodology is well-articulated, and the experiments are comprehensive, showcasing the robustness of the proposed Med-BiasX approach. But there are concerns regarding the novelty of the mechanisms, the lack of real-world clinical validation, and the absence of accessible code and datasets, which limit reproducibility. Additionally, some sections could benefit from clearer explanations to enhance understanding.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    None



Review #3

  • Please describe the contribution of the paper

    The paper introduces Med-BiasX, a debiasing framework for Medical VQA that mitigates overreliance on language priors. It integrates Energy-aware Confidence Constraint (ECC) and Distribution-aware Dependence Calibration (DDC) to suppress shortcut learning.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The paper introduces two complementary mechanisms—Energy-aware Confidence Constraint (ECC) and Distribution-aware Dependence Calibration (DDC) to suppress language biases in Med-VQA.

    2) The authors reconstruct SLAKE and VQA-RAD into SLAKE-BIAS and VQA-RAD-BIAS using controlled OOD sampling.

    3) The method is validated across both standard and newly created bias-sensitive datasets, using extensive ablation studies and comparison with state-of-the-art baselines.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) There appears to be an inconsistency in the manuscript regarding the baseline model: the Implementation Details section refers to UpDn [2] as the baseline, whereas the Experiments section cites RMLVQA [3] instead. Please clarify which architecture serves as the actual baseline.

    2) There is a substantial overlap in the theoretical concepts between your manuscript and prior work such as RUBi. It would be helpful to include a brief paragraph explicitly outlining the key differences, so that readers can clearly understand your unique contributions.

    3) The Implementation Details section, particularly the description of the baseline architecture, is quite limited. Expanding this section with more details about the architecture could be helpful. Additionally, It would be valuable for the authors to clarify whether their approach is model-agnostic, and if so, to what extent it can be applied across different architectures.

    4) It is highly recommended to include the base code, and access to the 2 curated datasets.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    1) The conceptual overlap with RUBi 2) The unavailability of the code and the curated biased datasets.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    My concerns were addressed




Author Feedback

R1 Q2, R3 Q4: Accessible code and datasets. Response: We have already prepared a GitHub repository to share our code publicly. But rebuttal guidelines prevent us from providing external links. Therefore, we will make the repository accessible immediately following the acceptance notification. R2 Q3, R3 Q1: Clarification of the VQA loss and baseline. Response: We use the UpDn architecture as a baseline and the RMLVQA loss as the base loss. R1 Q1: Novelty of components. Response: Compared to the ICIP11 paper “A symmetric KL divergence based spatiogram similarity measure”, which employs symmetric KL divergence as a measure of spatiogram similarity, we innovatively utilize the KL divergence to capture shortcut bias, which has not been explored. R2 Q1: Connection of motivation and methods. Response: As discussed in the 2nd paragraph of the introduction, imbalanced answer distributions give rise to shortcut dependencies, as models learn to predict frequent answers from question templates alone rather than integrating visual evidence. Although one could directly rebalance data or craft class-specific strategies, such approaches hinge on precise knowledge of the training distribution and quickly degrade when underlying statistics are noisy or when the data shifts in deployment. Instead, we focus on eliminating shortcut pathways during optimization to ensure the model must attend to image content, which both neutralizes the downstream impact of class imbalance and yields a more generalizable system that does not rely on brittle distributional assumptions. R2 Q2: Motivation for techniques. Response: In unbalanced multimodal learning, various strategies are proposed to balance the optimization of each modality. For example, gradient modulation strategies reweight unimodal gradients by estimating the performance difference between modalities. However, these methods target only weaker modalities, or intentionally compromise the training of well-learned modalities, or introduce additional neural modules, which complicates the training procedure. In contrast to these methods, we mitigate the modality imbalance problem from the energy and distribution perspectives based on the intrinsic properties of fusion features, ensuring that the model remains robust when trained under unbalanced data distributions (which is an essential requirement for robust Med-VQA), thus mitigating medical language bias. Take ECC mechanism as an instance. Energy naturally inversely correlates with model confidence: lower energy values indicate higher certainty. Therefore, we employ our ECC as a real-time metric for shortcut bias: when ECC is high despite low image support, it flags undue reliance on question-only patterns. R3 Q2: Our unique contributions. Response: While RUBi uses question-only branch outputs as masks to down-weight biased predictions and up-weight informative ones, our approach goes further by detecting and quantifying bias through energy-aware calibration and distribution-aware measures, then explicitly removing spurious correlations rather than merely attenuating their gradients. R3 Q3: Architecture Expansion. Response: We employ a popular VQA architecture UpDn as our baseline. We implemented our Med-BiasX model in PyTorch with a single RTX 3090 GPU and used the AdamW optimizer with a weight decay of 0.001. The batch size B is set to 64. The learning rate is set to 0.002. The value of the margin hyperparameter m is set to 1.0. Additionally, our approach is model-agnostic. We further evaluate the generalizability of Med-BiasX across additional architectures, including SAN, S-MRL, LXMERT, and SAN+MEVF. The Med-BiasX approach consistently outperforms the corresponding baselines, achieving gains of 3.01%, 2.66%, 3.40%, and 3.78% in overall accuracy on SLAKE-BIAS, demonstrating strong adaptability and model-agnostic performance across diverse network designs and highlighting its effectiveness in enhancing a broad range of model families.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top