Abstract

Medical Visual Question Answering enables large language models to answer questions related to clinical images. While domain-specific LLMs are capable of strong reasoning, their development can be costly. In contrast, general-purpose models are more efficient, but often lack deep understanding. Previous research has shown that integrating external knowledge enhances the performance of general-purpose LLMs, particularly for questions that involve complex medical terminology. To improve the utilization of external knowledge, we introduce a novel multimodal knowledge space pretraining method trained with the proposed Balanced Multimodal Contrastive Learning Loss. Our approach optimizes knowledge spaces through balanced contrastive learning across modalities, together with the auxiliary classification task. Additionally, we developed a novel framework to improve knowledge-driven Medical VQA for LLMs by integrating the pretrained knowledge space. Experiments on the Slake, VQA-RAD and PathVQA datasets demonstrate that our approach outperforms state-of-the-art Medical VQA methods with an average accuracy of 85.8%, 76.7%, and 60.0%, respectively. The source code is available at https://github.com/yaziciz/BaMCo.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3305_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/yaziciz/BaMCo

Link to the Dataset(s)

The datasets used in this work were publicly available in the following repositories: Slake: https://huggingface.co/datasets/BoKelvin/SLAKE VQA-RAD: https://huggingface.co/datasets/flaviagiammarino/vqa-rad PathVQA: https://huggingface.co/datasets/flaviagiammarino/path-vqa

BibTex

@InProceedings{YazZiy_BaMCo_MICCAI2025,
        author = { Yazıcı, Ziya Ata and Ekenel, Hazım Kemal},
        title = { { BaMCo: Balanced Multimodal Contrastive Learning for Knowledge-Driven Medical VQA } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {78 -- 88}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper is concerned with the task of Medical VQA. Specifically the authors argue that domain-specific LLMs have a huge development cost whereas general purpose LLMs lack in-depth understanding. The authors propose a framework named BaMCo: Balanced Multimodal Contrastive Loss, which optimizes knowledge spaces through balanced contrastive learning across modality relations together with the auxiliary classification task. Additionally, they leverage a novel method for intra-class image features to associate common anatomical landmarks with medical terms.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Following are some of the major strengths:

    1. The paper targets an interesting problem of developing domain-specific LLMs in a limited budget and amount.

    2. The paper has tried to develop loss functions for efficient inter-modality learning.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Weakness: Below are some of the major points for improvement:

    1. The paper is really hard to read and there is a lack of smooth flow to understand things clearly. Some sentences need to be read twice or thrice to grasp what the authors meant to say. For e.g., (i) In the Introduction section ‘BamCO Loss that focuses on intra-class visual features, definitions of terminology, and the relationships among the terms’, what’s the terminology, which terminology and which terms? (ii) ‘By utilizing the pretrained knowledge encoder, we include additional knowledge from the question to enable the LLMs to benefit from the pretrained multimodal knowledge space, without relying on extensive databases, as shown in Fig. 1.’ - This is too complex to follow at the first couple of instances what the authors are trying to say. Which additional knowledge and how and what is this knowledge encoder? I really suggest the authors to simplify the writing.

    2. Knowledge Source - The authors should clarify what are close-ended questions. Does it mean that the answer is either “yes” or “no” Moreover, why only use the close-ended questions which have an answer of “yes”. Sometimes a “No” answer may be appropriate if the question contains a “negate” term, then the authors are missing that sample association of image with the domain terms. Table 1 lacks in explaining what are the head entities, middle and tail entities and what are the relations and how is the final number of “All” column formed. Basically, there is very less to no explanation in how they established the relationships in the section of knowledge source which is one of the main aspects of the paper.

    3. Intra-Class Image Sampling: The authors describe randomly sampling intra-class images to retrieve common features, but it’s unclear why random selection is used instead of more structured methods, especially for imbalanced datasets where some classes have few images. This approach might miss capturing representative features, and the connection to in-context learning (ICL) isn’t explicitly clarified, leaving ambiguity about its novelty.

    4. BamCo pretraining - I don’t see the real point of using the cross entropy loss here. Even though you have a long-tailed distribution, because you are having intra-class samples fed into and concatenated, the contrastive loss with the textual descriptions will be taken care of during the learning and pretraining. Moreover, I also don’t see the point of this kind of training. Why not just employ a CLIP based cosine similarity when you have these extracted terms and entities without making the pretraining stage so complicated?

    5. BamCo pretraining - The authors concatenate multimodal embeddings with question tokens, but how the LLM effectively processes this high-dimensional, heterogeneous input to generate coherent answers is not well-explained. LoRA fine-tuning is mentioned, yet there’s no discussion on how it adapts the LLM to balance the influence of X, K, and V, especially given their differing modalities and potential misalignment in feature spaces.

    6. Evaluation protocol has several limitations - (i) To really test the proposed framework’s performance, I believe it should have been implemented (trained) only one of the datasets and tested on the other dataset to check the generalizability of the framework - rather than doing a dataset-specific training and testing. As most of the domain knowledge is from the dataset itself, it is intuitive that this extra training will actually make the performance better. So, a better way to really know whether the performance really improved would be to test on another dataset from the same domain. (ii) The scores of BLUE and ROGUE are not representative of the answer generation especially in the medical domain as has been noted by many benchmarking works. It would be better if the authors demonstrate the performance increment in terms of the semantic similarity match. BLUE and ROGUE scores will of course improve again because the dataset has been used for training the models. (iii) There is a lack of the ablation study on multiple aspects - The authors should not include the intra-class images and see the impact on the framework, which they have missed to demonstrated, second is a simple CLIP like training rather than the BaMCo loss, and the third is not including the cross entropy loss and see the performance.

    7. There have been some serious issues with using correct language - for e.g., (i) in the abstract ‘in-domain’ LLMs is very confusing? Usually in-domain refers to a task of In-Distribution and Out-of-Distribution, but here I believe, the authors want to imply domain-specific (for e.g., radiology, histopathology, etc.). It may confuse the reader which LLMs do the authors want to refer to? Could you please use a more appropriate term? (ii) in the abstract, ‘leverage a novel method’ - If you are proposing a novel method, it’s not leveraged, it’s built by you. So I don’t think leverage is suitable word here. When you leverage something it has already been built and no longer novel. This sentence also confuses the reader whether a novel method is built or an existing method has been leveraged. But I presume the authors mean the former.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Please see the weakness section.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the ideas are interesting, it is hard to follow the paper specifically to get a good grasp of the proposed approach. Moreover, certain explaination regarding loss functions, model architecture is not well motivated and explained. Additionally, it is hard to justify the experimental results based on the metrics adopted by the authors.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I appreciate the thorough rebuttal and response of the authors to my comments. Many of my doubts and comments were answered effectively and cleared. Thus, I have decided to raise my rating to a weak accept of score 4.

    The authors have demonstrated a novel architecture and after reading their rebuttal and the paper again in detail, I could grasp many of the doubts that were misunderstood earlier and not clear in the first instance. However, the writing of the paper still substantially needs to improve and thus it’s a weak accept from my side. Please see the list below about some of the suggested changes (not an exhaustive list). I encourage the authors to go through the manuscript again in detail and please update the writing to a more lucid and easy to follow style.

    1. ‘in-domain’ LLMs term is still confusing, please use domain-specific or something similar.
    2. Please don’t put ‘leverage’ a novel method, instead ‘developed’ a novel method.
    3. The paragraph ‘To address the above limitations, in this study, we propose a multimodal knowledge space pretraining approach, optimized by the proposed Balanced Multimodal Contrastive (BaMCo) Loss that focuses on intra-class visual features, definitions of terminology, and the relationships among the terms. Using a multimodal knowledge space, we address the limitations of modality representation and leverage inter-modality connections for better optimization. Additionally, we employ a long-tailed classification task, informed by the extracted features from both images and terms, as a supplementary regularization for the loss function. By utilizing the pretrained knowledge encoder, we include additional knowledge from the question to enable the LLMs to benefit from the pretrained multimodal knowledge space, without relying on extensive databases, as shown in Fig. 1.’ needs to be revised substantially (i) Fig. 1 references should be made at least a couple of times rather than only once in the text; (ii) The authors are not only utilising a knowledge encoder for knowledge space but also GLIMs and a pretrained vision encoder for the vision space, so this should also be mentioned; (iii) The long-tailed sentence needs to be revised in terms of explictly specifying that the regularization loss function is a cross entropy loss, there many other ways of acheiving long-tailed classification;
    4. The BERT score should be specified as a semantic similarity score either in the text or at least put a reference to the paper.
    5. Table 1 - The number of head and tail entities is not clearly coming out, how it was done. The authors need to put at least one sentence to demonstrate how they separated the head and tail entities, as well as the relations.

    Please ensure that the authors make these changes and other explanation changes as deemed necessary to improve the understandability of this important work.



Review #2

  • Please describe the contribution of the paper

    The authors introduce a method for improving the performance of generalist LLMs with domain specific knowledge without general domain-specific pre-training. They do this by learning a knowledge encoder that uses entities and relations extracted from the question as well as an intra-class image encoder using images whose embeddings most closely match the reference image. All of this together is used to train a contrastive loss so that the final trained components (now frozen) can be used to finetune an LLM on MedQA datasets. The final performance is competitive with SOTA Medical LLMs.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The problem of injecting generalist LLMs with medical knowledge without finetuning is essential considering the ever increasing power of generalist LLMs and how expensive it is to execute full finetuning. This allows smaller groups to access very strong medical LLMs at a fraction of the cost.
    • The proposed method seems sound and sensible if not a bit convoluted
    • The results are quite strong for no large-scale finetuning
    • Figure 3 is very strong
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Equation 1 is very poorly explained. What is B. What is B_j. What is Y. The text would greatly benefit from an intuitive explanation of what the loss is calculating.
    • Is the whole point of BaMCo only to train the GLIMS? This is the only part of Figure 2 that is left trainable. Or also the MLPs?
    • Missing ablations to the benefit of including the GLIMS intra-class image encoder branch and generally the effect of K for GLIMS performance.
    • Missing comparison to current SOTA LLaVa-Tri which was first posted in August 2024 and has been accepted to ICLR: https://arxiv.org/pdf/2408.02900v2. LLaVa-Tri significantly outperforms the proposed method, but that doesn’t necessarily disqualify this paper considering LLaVa-Tri is domain adapted and quite expensive to train and is not public.
    • What is “text-text matching” in Figure 4? What is the “fine-tuned BiomedCLIP” model?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The methodology seems sound and effective. There are still some open questions concerning how exactly the architecture is trained and some missing ablations, comparisons to SOTA.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    While I would appreciate additional ablations and comparisons to newer methods, I conceed it is out of scope of a rebuttal and believe the original paper is strong enough to warrant acceptance, particularly if the authors improve the clarity for the camera-ready version.



Review #3

  • Please describe the contribution of the paper

    The authors present a framework to improve knowldege driven VQA in medical domain, by proposing a multimodal pretraining apporach leveraging a balanced contrastive loss objective (BaMCo). The framework makes use of the current state-of-the-art visual-language models and builds on top by generating a latent multimodel knowledge space, which is finally integrated to allow better VQA performance

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Robust evaluation - The authors perform zero-shot and few-shot evaluations against other established frameworks (BiomedGPT-B, LLAVA-Med, etc. )

    Multiple Dataset experimentations- The authors experiment and evaluate the performance of the proposed method on 3 different VQA knowledge base datasets.

    The authors perform thorough experimentation and ablation to demonstrate the peformance improvements.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Writing clarity - The paper is written in a very vague way, and the the ideas dont come across clearly (e.g. Sec 3.2 Implementation Details).

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novel methodological advance and thorough experimentation and evaluation leads me to reccomend accept for this paper.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have adressed the raised concerns, and the study proposes an innovative loss framework for VLM applications in medicine. Further experiments and external validation is required to establish this framework, however the early proof of concept looks promising.




Author Feedback

We thank the reviewers for their comments.

Our work’s novelty lies in a knowledge-driven Medical VQA method, which minimizes reliance on extensive datasets. We propose a strategy for generating multimodal knowledge sources for knowledge space pretraining using a balanced contrastive loss. This enhances the answering capabilities of visual large language models with fewer resources than simply fine-tuning on specific datasets. Our results show improvements over previous studies. Please refer to the list of contributions at the end of the Introduction section.

Use of “No” Answered Close-Ended Questions (R1): Close-ended questions yield yes/no answers. To align with the CLIP framework and simplify training, we used only “yes” answers for positive image-text pairs. Similar to [27], we defined positive pairs for contrastive learning and treated others in the batch as negative if sampled from different entity classes. However, including “no” answers for additional negative samples could be interesting for future study.

BaMCo Pretraining, Reason for Cross Entropy Loss, Comparison with CLIP-based Learning (R1): Our method builds on findings from [31], which show that contrastive learning effectively learns representations but struggles with long-tailed distributions due to high-frequency classes dominating optimization. To address this, they introduce a two-branch framework combining contrastive learning with a classification branch using CE loss for class imbalance and show the improvement. In our BaMCo framework, therefore, we retain CE loss to stabilize learning and reduce bias towards head classes. Our ablation study in Table 3 and Figure 4 indicates that the BaMCo loss significantly improves training compared to CLIP-based cosine similarity, demonstrating the method’s advantages over CLIP-based learning.

Multimodal Integration in LLM (R1): We project each modality’s embedding into a shared space to manage heterogeneous inputs, as shown in Figure 2. This places similar images/texts closer together. To align the embeddings of the pretrained modality encoders for answer generation, we use MLPs or pooling perceivers in Figure 3. Inspired by LLaVA-Med [13], these embeddings are merged with question tokens, as explained in the “Knowledge-driven Medical VQA” section, and further aligned with LoRA fine-tuning.

Intra-Class Image Sampling and In-Context Learning (R1): We select images within terms, avoiding combinations of different terms; this prevents dataset imbalance from impacting learning. The manuscript specifies this as “…common features among images in the same entity class…”. Although not optimized, this method sufficed for evaluating multimodal improvements; future work may explore structured selection. We did not focus on ICL but instead employed a CLIP-based framework.

Evaluation Protocol and Metrics (R1):We used standard dataset splits to prevent overlap in train/test samples. Due to differences in medical datasets, cross-dataset evaluation is not applicable, so we followed the standard approach in the literature. We reported exact match, BLEU, and ROUGE scores for lexical comparisons and used the BERT Score for semantic similarity.

Figure 2 Explanation (R2): We train both GLIMS and the MLP layers.

Missing Ablation Studies (R1/R2): Due to space limitations, we only focused on key comparisons. Our architecture builds on [31] and incorporates GLIMS and BaMCo loss for improved performance. Although individual module ablations were not detailed, we observed improvements in our internal experiments, and their contributions are evident in Table 3. Furthermore, Table 3 and Figure 4 show that BaMCo achieves stronger performance than BiomedCLIP + alignment, which only uses the CLIP loss. Although we did not isolate CE loss specifically, its effectiveness is evident and grounded in prior work [31].

General Clarity (R1/R2/R3): Suggested clarifications will be reflected in the manuscript.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    The work seems to have clear method contribution and is supported by extensive experiments. I recommend the authors to objectively view the reviewer’s response and address/clarify major concerns.

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    While all three reviewers pointed out writing clarity as a key area for improvement, the technical contributions were ultimately acknowledged as novel, effective, and impactful. The rebuttal was comprehensive and addressed major concerns around design motivation, loss function utility, and evaluation protocols. All three now recommend acceptance. As a meta-reviewer, I echo their sentiment to recommend Accept.



back to top