Abstract

Medical visual question answering (MedVQA) plays a vital role in clinical decision-making by providing contextually rich answers to image-based queries. Although vision-language models (VLMs) are widely used for this task, they often generate factually incorrect answers. Retrieval-augmented generation addresses this challenge by providing information from external sources, but risks retrieving irrelevant context, which can degrade the reasoning capabilities of VLMs. Re-ranking retrievals, as introduced in existing approaches, enhances retrieval relevance by focusing on query-text alignment. However, these approaches neglect the visual or multimodal context, which is particularly crucial for medical diagnosis. We propose MOTOR, a novel multimodal retrieval and re-ranking approach that leverages grounded captions and optimal transport. It captures the underlying relationships between the query and the retrieved context based on textual and visual information. Consequently, our approach identifies more clinically relevant contexts to augment the VLM input. Empirical analysis and human expert evaluation demonstrate that MOTOR achieves higher accuracy on MedVQA datasets, outperforming state-of-the-art methods by an average of 6.45%. Code is available at https://github.com/BioMedIA-MBZUAI/MOTOR.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2665_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/BioMedIA-MBZUAI/MOTOR

Link to the Dataset(s)

MIMIC-CXR-JPG: https://physionet.org/content/mimic-cxr-jpg/2.1.0/ Medical-Diff-VQA: https://physionet.org/content/medical-diff-vqa/1.0.0/ MIMIC-CXR-VQA: https://physionet.org/content/mimic-ext-mimic-cxr-vqa/1.0.0/ CXR-PRO: https://physionet.org/content/cxr-pro/1.0.0/

BibTex

@InProceedings{ShaMai_MOTOR_MICCAI2025,
        author = { Shaaban, Mai A. and Saleem, Tausifa Jan and Papineni, Vijay Ram Kumar and Yaqub, Mohammad},
        title = { { MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {467 -- 477}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors introduce MOTOR, a new multimodal retrieval and re-ranking framework that incorporates grounded captions and optimal transport. The method aims to model the relationships between the query and retrieved context using both textual and visual cues. This approach appears effective in surfacing more clinically relevant contextual information to enhance the input to vision-language models (VLMs).

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    A key strength of this work lies in its integration of grounded captions and optimal transport for multimodal retrieval, which allows the model to capture fine-grained semantic alignments between queries and retrieved contexts across modalities

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Weaknesses:

    1. Unclear connection between reasoning challenges and proposed solution: While the abstract highlights limitations in the reasoning capabilities of VLMs, the paper does not clearly articulate how the proposed optimal transport (OT) approach directly addresses this issue. A more explicit explanation or theoretical linkage would strengthen the contribution.

    2. Limited discussion on low-resource settings: The Introduction mentions context limitations in low-resource clinical settings, but the paper does not elaborate on how the proposed method mitigates or is applicable to such scenarios. Clarifying this would enhance the practical relevance of the work.

    3. Ambiguity around the term “faithfulness”: The authors reference the “faithfulness” of their model, but this term is not clearly defined in the context of the paper. It is unclear whether it refers to alignment with ground truth, interpretability, or some other criterion.

    4. Inconsistent description of similarity measures: The manuscript claims that the OT formulation leverages multiple similarity measures, yet only cosine similarity appears to be used in practice. This inconsistency could confuse readers and warrants clarification or revision.

    5. Lack of clarity in Algorithm 1: The vector database 𝐷 in Algorithm 1 is not clearly defined. It would be helpful to specify what it contains (e.g., embeddings, captions, image- text pairs) and how it is constructed.

    6. Dataset usage is unclear: The experimental setup is somewhat confusing. The model appears to be trained on MIMIC-CXR-JPG, yet evaluation is conducted on Medical- Diff-VQA and MIMIC-CXR-VQA. The rationale for this dataset split, and whether domain adaptation or fine-tuning was applied, should be made explicit.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall Assessment: While the paper presents an interesting idea by integrating grounded captions and optimal transport for multimodal retrieval, the weaknesses outlined above—particularly the lack of clarity in methodological contributions and inconsistencies in the experimental setup—significantly outweigh the strengths. As it stands, the paper would benefit from substantial clarification and additional evidence to support its claims before it can be considered for publication.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    It remains unclear how the “vector database” was constructed. Was any model used to compute the vector representations? This is a critical component of the retrieval pipeline, yet it has not been adequately addressed.

    As such, my initial rating stands.



Review #2

  • Please describe the contribution of the paper

    This paper proposes a new multi-modal retrieval and reranking method to effectively address the lack of relevant clinical contexts in VLM output. Experiments prove its effectiveness.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This is the first paper that incroporates multi-modal reranking within medical MM_RAG frameworks.
    • The paper is well written and has good results
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • It lacks efficiciency analysis.
    • It lacks some experiments, for instance, the impact of $s$ and $k$. Besides, Fig. 4 needs more discussions to illustrate the advantages of this method.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper is novel because it is the first attempt to demonstrate that the highly effectivness of reranking to provide relevant context to improve medical VQA. However, my concen is the lack of sufficent ablations on key parameters such as $s$ and $k$.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors effectively address my concerns. I vote for acceptance.



Review #3

  • Please describe the contribution of the paper

    The paper introduces MOTOR, a framework for medical visual question answering that leverages optimal transport to re-rank retrieved multimodal samples from the MIMIC-CXR dataset, based on a query comprising an image, captions, and a question. By computing a multimodal similarity matrix and optimizing it with optimal transport, MOTOR selects the most clinically relevant contexts give to a vision-language model within a RAG system. They show improved results compared to text-based RAG models and direct answer prediction without RAG.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Applying optimal transport to re-rank retrieved elements in a multimodal MedVQA setting seems to be novel in the domain. It likely enhances the medical relevance of retrieved samples, as the image is crucial to make this decision.
    • The paper demonstrates a significant performance gain compared to prior methods, both text-based RAG systems as well as VLLMs without knowledge retrieval. They test their method on two models and two datasets.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The knowledge base as well as the datasets the method is evaluated on stem from MIMIC-CXR. Which splits are used for evaluation and as the knowledge base? Is there any conflict?
    • The novelty is somewhat limited. The main contribution is to use OT in a multimodal VQA setting, while it has already been applied to retrieval ranking for text-only settings. Also there have been works on OT for cross-modal retrieval such as https://arxiv.org/html/2403.13480v1.
    • The radiologist evaluation was only done for the proposed method and therefore does not give a lot of information - it would be good to also do it for the baselines (at least the best baseline) to show the benefit of MOTOR.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed concept has not been used for the problem at hand and I believe the consideration of multimodal data for knowledge retrieval in MedVQA is crucial. Further the results demonstrate relevant improvements. Nevertheless, I would like to see clarifications regarding the use of MIMIC based datasets for both knowledge and testing and additional radiologist evaluation for the baseline, which I hope the authors will address in the rebuttal.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors clarified most of my questions, and I believe the contributions in the new multi-modal OT framework are valuable. Therefore I recommend acceptance.




Author Feedback

We thank the reviewers for their valuable feedback. We are encouraged by their positive reception of our work and appreciate their recognition of its key strengths, including 1) the novel integration of grounded captions and OT for fine-grained multimodal retrieval(R1,R3), 2) the first to incorporate multimodal reranking in a medical RAG framework(R2), and 3) strong results and clear writing(R2,R3). We have addressed the comments and will reflect the improvements and further clarification in the revised paper.

R1,R3-Knowledge base and data split: As shown in our pipeline (Fig2), our method requires no training or fine-tuning. Instead, it dynamically retrieves clinically relevant reports from the original MIMIC-CXR-JPG’s train split, which serves as the knowledge base. For evaluation, we used the original test splits of Medical-Diff-VQA and MIMIC-CXR-VQA and ensured no overlap with retrieval data. The vector database consists of pre-computed embeddings of images and medical reports (Datasets,pg5).

R1-Reasoning: The reasoning gap of standard RAG lies in the limited scope of evidence and ungrounded context [11,12]. Our OT method addresses this by optimizing context selection (Intro,pg3). Theoretically, minimizing OT cost (Eq1-4) ensures answers are grounded in clinically aligned references, providing traceable evidence and suppressing noise that could mislead VLMs. This will be clarified.

R1-Low-resource settings: MOTOR operates with frozen VLMs (Fig2), thus requiring no fine-tuning or large-scale training data. It performs dynamic retrieval and selects the most relevant clinical evidence at inference time from definite number (top-k) of retrievals (Sec3,pg5&6), making it particularly suitable for low-resource clinical settings with no training capabilities.

R1-Faithfulness: While other works in literature [21,31] use “faithfulness” to describe factual alignment between answers and context, we did not use this term to refer to our model but rather on the factual alignment between generated and ground truth answers (Tab1) and clinical evidence matching (Fig4). This will be clarified.

R1-Similarity measures: The term “measures” is misleading. We use cosine similarity to measure multiple relationships (Eq1): 1) question-report alignment, 2) textual similarity, and 3) visual similarity—each operating on different modalities and features. We will revise the wording.

R2-Efficiency: MOTOR builds on existing retrieval pipelines without modifying their architecture and requires no training, thus maintaining comparable efficiency with no extra memory. The extra cost at inference time is limited to cosine similarity computations and Sinkhorn-based OT 10.

R2-Ablations and Fig4: We conducted ablations on different values of k and s (Sec3,pg6) and found that retrieving too many candidates (large k or s) introduces noise, while too few misses relevant information. Across variations, performance changes were minimal (±0.2% accuracy) with a consistent trend observed (MOTOR>baseline rerankers>no reranking). We will expand the discussion of Fig4.

R3-Prior work: In contrast to prior works and the mentioned paper, MOTOR introduces three key novelties: 1) a training-free OT framework, 2) integration of grounded captions, and 3) clinical validation of this combined approach. To our knowledge, this synergy between OT and grounded retrieval is the first to address clinical relevance in MedVQA.

R3-Evaluation: The radiologist mainly evaluated whether grounded captions could reliably serve as evidence (not limited to MOTOR), which is not possible for the best baseline (text-only). Further evaluations will be explored in future work.

We appreciate the constructive comments, which have further strengthened our work. We believe the revisions will further improve the clarity and impact of our work.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The proposed method is quite intuitive as it investigates the Optimum Transport for improving the reranking of the retrieval module and thereafter improves the RAG-based generation via VLMs. Most of the reviewers’ concerns are addressed in the rebuttal, but the implementations of the vector database are not well-explained in the rebuttal. The authors are encouraged to spare more space on the implementation details to introduce how the visual and textual embeddings are extracted to form such a pre-computed vector database.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper presents a novel multimodal retrieval and re-ranking framework (MOTOR) for medical VQA, integrating grounded captions with optimal transport. The authors’ rebuttal effectively addressed all major points, and the overall contribution is both original and practically relevant. I recommend acceptance.



back to top