Abstract

Reasoning is a critical frontier for advancing medical image analysis, where transparency and trustworthiness play a central role in both clinician trust and regulatory approval. Although Medical Visual Language Models (VLMs) show promise for radiological tasks, most existing VLMs merely produce final answers without revealing the underlying reasoning. To address this gap, we introduce MedVLM-R1, a medical VLM that explicitly generates natural language reasoning to enhance transparency and trustworthiness. Instead of relying on supervised fine-tuning (SFT), which often suffers from overfitting to training distributions and fails to foster genuine reasoning, MedVLM-R1 employs a reinforcement learning framework that incentivizes the model to discover human-interpretable reasoning paths without using any reasoning references. Despite limited training data (600 visual question answering samples) and model parameters (2B), MedVLM-R1 boosts accuracy from 55.11% to 78.22% across MRI, CT, and X-ray benchmarks, outperforming larger models trained on over a million samples. It also demonstrates robust domain generalization under out-of-distribution tasks. By unifying medical image analysis with explicit reasoning, MedVLM-R1 marks a pivotal step toward trustworthy and interpretable AI in clinical practice.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3267_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/JZPeterPan/MedVLM-R1

Link to the Dataset(s)

N/A

BibTex

@InProceedings{PanJia_MedVLMR1_MICCAI2025,
        author = { Pan, Jiazhen and Liu, Che and Wu, Junde and Liu, Fenglin and Zhu, Jiayuan and Li, Hongwei Bran and Chen, Chen and Ouyang, Cheng and Rueckert, Daniel},
        title = { { MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {339 -- 349}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper demonstrates that RL-based fine-tuning (via GRPO) of VLM for medical VQA task outperforms standard supervised fine-tuning (SFT), especially on out-of-domain data. Also, it presents interesting explicit examples of reasoning capabilities of the fine-tuned model.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Strong empirical results: model fine-tuned via GRPO significantly outperforms SFT model and other baselines on medical VQA dataset, especially in domain shift scenarios.
    • Good paper organization.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The paper does not present a full-fledged medical VLM (as it may seem from the title), the model is evaluated only on a particular medical VQA dataset.
    • Methodological novelty is limited: authors apply the recently established GRPO method in its original form to a particular task (medical VQA).
    • Additional analysis and discussion of the better generalization of RL-based model is needed.
    • Authors do not promise to share the code for their experiments. Sharing the code would be useful for researchers and practitioners interested in RL-based training of medical VLMs.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    I would suggest authors to rename the paper and the model in order to make it clear that their model is not a full-fledged, all-purpose VLM, but just a VQA model with reasoning capabilities, evaluated on a particular VQA dataset.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a proof-of-concept experiment that RL-based fine-tuning of general-domain VLMs to medical domain tasks can be more efficient than standard SFT. I believe that along with the code (if authors will share it) it is a valuable contribution to the MICCAI community. However, in order not to overstate its value, the paper and the model should be better titled in other way.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    Authors developed MedVLM-R1, a medicalVLM for radiological task that use GRPO-based RL to obtain explicit reasoning alongside the final answer, rather than providing only the final answer.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors address a very important field almost unexplored: explicability of the AI model. This point is now asked in the new AI law voted by the European Union so research in that field is very important.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • no test/retest analysis was performed to evaluate the variability and robustness of the answer.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • It could be very interested to analyse the variability of the answer of the model and potential variability in model’s thinking
    • It is not clear what kind of performance is noted in the table 1.
    • It could be very interesting to combine this approach with attention maps. When the model thinking says that answer is Urolithiasisare because urines calculi is observed in the image, it would be interested to know where the model found the calculi.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Even work still need to be done, explicability studies are very important to allow AI model implementation in clinical routine.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The current paper proposes to use reinforcement learning in generative vision language model. The proposed model provides VLM model with better reasoning explainability by outputting model’s learning process.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The investigated area is of good clinical interest and the proposed method is trying to address the critical problem in generative modelling.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    As discussed in the paper, the currently proposed model is limited in out-of-distribution data inferencing and somewhat limited in only handling MCQ questioning format.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The investigated topic is of great clinical and industrial interest.

  • Reviewer confidence

    Not confident (1)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely appreciate the insightful and positive feedback from all reviewers and the meta-reviewer. We are encouraged by the consensus on the importance and relevance of our work, particularly its potential clinical impact, explicit reasoning capabilities, and strong empirical results. Additionally, we are grateful for the constructive suggestions, which will significantly enhance our paper.

Concerning reproducibility (R2, R3), we fully recognize its importance and commit to releasing our code prior to the MICCAI 2025 conference. This will facilitate reproducibility and encourage broader adoption by researchers and practitioners.

Regarding concerns about potentially overstating our contributions (R2, Meta-reviewer), we acknowledge this valuable feedback. While our current model primarily targets VQA tasks, our ongoing research aims to develop a comprehensive, general-purpose medical VLM. To address these concerns clearly, we will explicitly emphasize in critical sections such as the abstract and introduction that our current scope specifically focuses on VQA tasks. Additionally, we will outline our future directions toward broader medical VLM applications, including grounding and report generation.

Regarding the variability of model responses and the integration of attention maps (R3), these suggestions indeed strengthen the clinical applicability and explainability of our model. We will thoroughly discuss and incorporate these considerations into our revised manuscript.

We sincerely thank the reviewers and meta-reviewer once again for their insightful comments, which have significantly improved the quality and clarity of our paper. We look forward to presenting our research and sharing our open resources with the MICCAI community in Korea!




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    The paper introduces MedVLM-R1, a vision-language model enhanced through reinforcement learning to improve medical reasoning capabilities, with a focus on radiological applications. The three reviewers unanimously support acceptance, though with varying degrees of enthusiasm—two assign a “weak accept” rating while one recommends straightforward “accept.”

    Reviewers highlight several strengths, most notably the model’s ability to generate explicit reasoning chains alongside its answers. Yet, paper’s novelty was questioned, as it applies GRPO in a fairly standard way rather than introducing methodological novelty. I personally agree that the title overpromises by implying a general-purpose medical VLM when the model is actually specialized for VQA tasks—a clarification that would set more accurate expectations.

    Trusting the authors will address the concerns of the reviewers and amend the title, I recommend a provisional accept for this work.



back to top