List of Papers Browse by Subject Areas Author List
Abstract
Biomedical visual question answering (VQA) has been widely studied and has demonstrated significant application value and potential in fields such as assistive medical diagnosis. Despite their success, current biomedical VQA models perform multimodal information interaction only at the model level within large language models (LLMs), leading to suboptimal multimodal semantic alignment when dealing with complex tasks. To address this issue, we propose BioD2C: a novel Dual-level Semantic Consistency Constraint Framework for Biomedical VQA, which achieves dual-level semantic interaction alignment at both the model and feature levels, enabling the model to adaptively learn visual features based on the question. Specifically, we firstly integrate textual features into visual features via an image-text fusion mechanism as feature-level semantic interaction, obtaining visual features conditioned on the given text; and then introduce a text-queue-based cross-modal soft semantic loss function to further align the imag
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0768_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/jzy-123/BioD2C
Link to the Dataset(s)
VQA-RAD dataset: https://osf.io/89kps/
SLAKE dataset: https://www.med-vqa.com/slake/
Path-VQA dataset: https://github.com/UCSD-AI4H/PathVQA
BioVGQ dataset: https://huggingface.co/datasets/jzyang/BioVGQ
BibTex
@InProceedings{JiZhe_BioD2C_MICCAI2025,
author = { Ji, Zhengyang and Gao, Shang and Liu, Li and Jia, Yifan and Yue, Yutao},
title = { { BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQA } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15969},
month = {September},
page = {86 -- 96}
}
Reviews
Review #1
- Please describe the contribution of the paper
- The paper conducts data cleaning on existing datasets and generates a new dataset using GPT.
- The paper proposes the BioD2C multi-modal feature fusion framework, achieving comparable resultson multiple existing datasets.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The authors demonstrate strong writing skills, presenting the work in a clear and fluentmanner.
- The authors performed data cleaning on the PMC-VQA dataset and generated additional QA data using GPT.
- The authors propose an innovative multimodal feature alignment method (BioD2C) for medical models.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Questionable MFE Module Design The MFE module resembles FPN’s multi-scale fusion but only uses the Vision Encoder’s final layer with varying pooling kernels. Simple pooling cannot generate meaningful features, making this module appear ineffective. Even adding an MLP would be better, though still suboptimal.
- Unclear BioVGQ Dataset Criteria The classification model for “clean/polluted” samples lacks definitions, methodology details, and justification, undermining the dataset’s credibility.
- No QA Sample Validation The ChatGPT-generated QA samples lack cleaning procedures to filter unrealistic/polluted samples, raising quality concerns.
- Marginal Improvements Despite More Data Results show limited gains over baselines despite using more training data (including BioVGQ), failing to demonstrate data efficiency.
- Narrow BioVGQ Utility The dataset was only used for BioD2C without testing other models, severely limiting its demonstrated research value.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
While this paper presents some interesting ideas (e.g., BioD2C framework, GPT-augmented dataset), several fundamental limitations significantly undermine its contribution:
- Technical Flaws in Methodology (1) The proposed MFE module lacks sound design principles, with questionable effectiveness in feature extraction through simple pooling operations. (2) No validation was provided for the GPT-generated QA data, risking noise propagation.
- Insufficient Empirical Validation (1) Marginal performance gains over baselines, despite using more training data, suggest limited methodological advancement. (2) The BioVGQ dataset’s utility remains unproven, as it was only tested on the authors’ model.
- Reproducibility & Rigor Concerns (1) Critical details (e.g., “clean/polluted” sample definitions, QA data cleaning) are missing, hindering reproducibility. (2) The novelty of BioD2C is not clearly differentiated from prior multi-modal fusion approaches.
These issues collectively limit the paper’s technical soundness and potential impact, so I rated it as a weak reject. With major revisions (e.g., rigorous ablation studies, dataset validation, and clearer novelty analysis), this work could be reconsidered.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The author addressed most of my concerns, so I can accept the paper. However, I remain conservative since the experimental results were not provided in the rebuttal.
Review #2
- Please describe the contribution of the paper
The paper introduces BioD2C, a novel framework for biomedical visual question answering (VQA) that improves multimodal semantic alignment through a dual-level interaction strategy. Unlike prior models that rely solely on large language models (LLMs) for image-text fusion at the model level, BioD2C integrates feature-level fusion via a Transformer-based image-text interaction module and a gating mechanism to produce text-conditioned visual features. Additionally, the framework introduces a text-queue-based cross-modal semantic loss that aligns the distributions of visual and textual features, further enhancing semantic consistency. To support effective training, the authors construct BioVGQ, a new biomedical VQA dataset with 81K high-quality images and 188K question-answer pairs, designed to mitigate the noise and mismatch issues in existing datasets. Experimental results on multiple benchmarks (SLAKE, Path-VQA, RAD-VQA) demonstrate that BioD2C outperforms state-of-the-art biomedical VQA models, highlighting its effectiveness and generalizability.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Novel Dual-Level Semantic Alignment Strategy: The proposed BioD2C framework introduces a unique combination of feature-level image-text fusion and model-level interaction within an LLM, addressing a key limitation in existing biomedical VQA methods which rely solely on model-level fusion. Innovative Cross-Modal Semantic Loss: The paper proposes a text-queue-based semantic loss, inspired by contrastive learning and MoCo-like mechanisms, to better align text and image feature distributions. This is a new loss design for VQA, enhancing multimodal consistency. Strong Empirical Results Across Benchmarks: BioD2C outperforms several strong biomedical VQA models (e.g., BiMediX2-8B, RadFM, LLaVA-Med) across a range of tasks and metrics (ACC, BLEU-1, ROUGE-1), showing generalizability and robustness. Thorough Ablation and Visualization Studies: The ablation experiments convincingly demonstrate the importance of each component (semantic loss, fusion mechanism, dataset choice). Visualization of attention maps supports the claim that the model dynamically attends to relevant regions in the image based on the question.
Good Clinical Relevance and Framing: While still in a research setting, the paper is well-motivated from a clinical perspective, with emphasis on image-grounded answering (as would be needed in decision support scenarios).
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Limited Description of Dataset Generation Process: While the paper introduces the BioVGQ dataset, the description of how ChatGPT-generated question-answer pairs are validated or quality-controlled is lacking. There is no indication of manual validation, annotation guidelines, or spot-checking for medical correctness. ChatGPT is known to hallucinate or oversimplify biomedical reasoning, especially when image context is only partially available. This poses serious risks of introducing biased, misleading, or clinically incorrect content into the training set. Additionally, the paper uses a classifier trained on just 3,000 manually labeled images to filter “clean” vs. “polluted” images out of 77K+ from PMC. There is no detailed validation or inter-rater agreement reported for this filtering. This raises concerns about false positives or negatives, and whether “clean” images are truly representative of clinical data. Text-Queue Loss Design Lacks Comparative Baselines: The proposed semantic loss function is interesting, but the paper doesn’t compare it to alternative alignment strategies (e.g., contrastive loss, triplet loss, CLIP-style cosine similarity). It’s unclear if the observed gains are due to the specific design or simply due to the presence of an auxiliary self-supervised alignment signal that reinforces multimodal consistency. Limited Discussion on Model Complexity and Inference Time: While the paper discusses training efficiency (via LoRA, Deepspeed, etc.), it doesn’t report inference time, memory usage, or model size in detail. This information is key for evaluating clinical scalability. Some Clarity Gaps in Method Description: Mathematical notations in the fusion module and semantic loss are somewhat dense and could be better explained. There is minimal discussion of why the multi-scale fusion is effective beyond empirical performance.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This paper presents a novel dual-level semantic alignment framework for biomedical visual question answering (BioD2C), combining feature-level image-text fusion with a cross-modal semantic loss to improve alignment between questions and visual content. The method is well-motivated, and the proposed architecture shows clear improvements over strong baselines across several public biomedical VQA datasets. The addition of a new dataset, BioVGQ, further contributes to the field. However, the data generation process heavily relies on ChatGPT-generated Q&A pairs and image filtering via a weakly validated classifier, without sufficient human or clinical expert verification - raising concerns about the reliability and clinical validity of the dataset. Furthermore, while the semantic loss is a key component, the paper does not compare it to standard self-supervised alignment strategies, making it unclear whether the improvements stem from the specific design or simply from additional training signals. Despite these concerns, the method is novel, the results are promising, and the core ideas are relevant to the MICCAI community.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
I recommend acceptance based on the strength and novelty of the proposed method (BioD2C), which introduces a dual-level semantic alignment strategy and demonstrates clear empirical improvements over strong baselines. However, I leave it to the discretion of the area chairs to assess whether the dataset construction and validation process meets the standards expected for MICCAI and, consequently, whether the paper should be accepted. While the BioVGQ dataset is a promising contribution, it lacks essential validation details: no precision or recall metrics are reported for the image filtering classifier; no statistics are provided on the frequency or nature of corrections during the ChatGPT-based “re-feeding” process; and there is no clear explanation or quantification of how captions and ChatGPT outputs were “cross-validated”. Additionally, the term “crafted prompts” is vague and not clearly defined. These issues do not diminish the methodological contribution of the paper, but they do limit the reliability and reusability of the dataset without further exploration and clarification.
Review #3
- Please describe the contribution of the paper
1.This paper introduces a multimodal dataset, BioVGQ, which contains 81,000 medical images and 188,000 question-answer pairs. 2.The authors propose a novel BioD2C algorithm, which achieves state-of-the-art (SOTA) performance on biomedical VQA benchmarks, including both their own dataset and other datasets.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1.Novel dataset (BioVGQ): Authors propose BioVGQ, a large multimodal biomedical dataset (81K images, 188K QA pairs), addressing limitations in prior datasets through better image-question alignment.
2.Innovative BioD2C Framework: The dual-level semantic alignment approach, especially the feature-level fusion mechanism and text-queue-based semantic loss, effectively enhances multimodal semantic consistency.
3.State-of-the-art Results: BioD2C achieves state-of-the-art performance across multiple biomedical VQA benchmarks, validating its effectiveness and generalizability.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1.Lack of details: The authors mention using a small number of manually annotated samples to train a classifier for filtering images. However, the specific filtering criteria, thresholds, and error rates are unclear. Additionally, the strategy for selecting GPT-4o responses is not described clearly.
2.Lack of hyperparameter K ablation studies: The paper does not provide an ablation study on the hyperparameter K. Including an analysis of how different values of K affect model performance, along with additional visualizations, would be beneficial.
3.Suggestion for visualization: It would be helpful to include figures or tables to clearly illustrate the differences between the proposed dataset and existing multimodal medical datasets.
4.Anonymity concerns: The dataset and code do not seem to be anonymized. I’m concerned that publicly sharing the GitHub link during the double-blind review process could potentially log reviewer identities upon clicking.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The reliability of the experimental results, as well as the results of Sota, while the data set is large
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
Thanks to the reviewers for your valuable comments of our work. We address the concerns as follows: Details of BioVGQ Dataset Construction (R1: Q2/Q3, R2: Q1, R3: Q1) Thank you for the valuable comments. Space constraints kept us from detailing BioVGQ construction, so we clarify here. We label an image as “polluted” if it has 1. more than six sub-figures, which dilutes useful information, or 2. clearly man-made content such as tables or hand sketches. All others are “clean.” Because polluted and clean images differ visibly, a PMC-CLIP-based classifier fine-tuned on 3 k annotated samples is sufficient for reliable separation. For ChatGPT-generated QA pairs we ensure quality by (1) re-feeding each pair to ChatGPT for self-check and correction, (2) jointly inputting image and caption so they cross-validate, and (3) using carefully crafted prompts. These dataset-creation details will be included in the revised version. Marginal Improvements Despite More Data? (R1: Q4) As shown in Table 1 of our paper, BioD2C achieves superior performance over existing SOTA model on nearly every benchmark, with an average score of 0.641 compared to 0.616 for the previous best. Importantly, this improvement is obtained using our BioVGQ dataset, which contains only 81K images and 188K QA pairs, substantially smaller than the datasets used by baselines (e.g., PMC-VQA’s 149K images and 227K QA pairs). These results clearly demonstrate both the effectiveness of our BioD2C and the data efficiency of BioVGQ. Design and Effectiveness of the MFE Module (R1: Q1, R2: Q4) Thank you for your valuable comment. We believe that multi-level pooling effectively captures features at different granularity levels, a capability that has been validated by prior work. Although it may not be the optimal approach, the subsequent image-text fusion model further refines the features extracted by MFE, ensuring overall effectiveness. In addition, we have added an ablation study on MFE in the revised version to demonstrate its contribution. Uniqueness and Effectiveness of the Text-Queue Loss (R2: Q3) Thank you for the comment. Indeed, other alignment strategies cannot be directly applied to our VQA setting for the following reasons: 1. It’s hard to define positive/negative pairs in VQA. 2. Our goal is to adapt visual features toward text semantics, not enforce mutual similarity. 3. Direct cosine optimization can make visual features overly similar to text, losing unique visual structure. Moreover, our ablation study demonstrates the effectiveness of the Text-Queue Loss. Selection of the Text-Queue Length k (R3: Q2) Thank you for your valuable comment. The selection of the queue length k was guided by prior work. We set k to 30 as a balanced choice that offers sufficient semantic coverage without compromising quality. Training and Inference Details of BioD2C (R2: Q3) Details regarding the model size, memory usage, and training configurations of BioD2C are available in the accompanying codebase and model repository. More Test Results of Models on BioVGQ (R1: Q5) Thanks for the good suggestion. BioVGQ is designed primarily as a high-quality training dataset rather than a benchmark, and our current results demonstrated its pivotal role in model training. According to your suggestion, we will include further experiments and analyses addressing this point. Clearer Presentation of Method Details (R2: Q4) We will revise Section 3 to improve readability and simplify complex notations. Visualization suggestion (R3: Q3) Thank you for the suggestion. We will include a visualization comparison between BioVGQ and other datasets in the revised version. Anonymity concerns (R3: Q4) Thanks for raising this concern. We assure that the open-source links provided in the paper do not reveal any information that can identify the visitors.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Reject
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
Three reviewers recommended acceptance after rebuttal, citing strong methodological contributions, empirical performance, and overall clarity.
Reviewer #1 acknowledged the novelty of the BioD2C framework and the construction of the BioVGQ dataset. While initially noting concerns about the MFE module design, lack of validation for GPT-generated QA pairs, and marginal performance gains, the reviewer found the rebuttal addressed most issues and ultimately supported acceptance, albeit cautiously due to the absence of additional experimental results.
Reviewer #2 emphasized the dual-level semantic alignment strategy and the introduction of a text-queue-based semantic loss as key innovations. The model showed consistent improvements across multiple biomedical VQA benchmarks. While concerns remained about the dataset’s generation and lack of comparison for the proposed loss function, the reviewer endorsed acceptance based on the method’s novelty, empirical strength, and clinical relevance.
Reviewer #3 highlighted the scale and quality of the BioVGQ dataset and the strong performance of BioD2C across benchmarks. Suggestions included more detail on dataset filtering, hyperparameter ablations, and improved visualizations. Despite minor concerns, the reviewer supported acceptance, citing the work’s reliability, contribution, and potential impact.
Overall, the paper was accepted based on its novel framework, new dataset, and competitive results, with remaining concerns viewed as addressable in future work.
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A