Abstract

Medical Large Vision-Language Models (Med-LVLMs) have shown promise in enhancing medical diagnosis by enabling interactive and knowledge-driven healthcare applications. However, these models often suffer from factual hallucinations which may lead to incorrect diagnoses. Retrieval-augmented generation (RAG) has been proposed to mitigate these issues, yet its effectiveness in multi-modal medical applications is hindered by over-reliance on retrieved data and the opacity of text-based reasoning. To address these challenges, we propose GoCa, a multi-modal RAG system based on chain-of-thought (CoT) distillation and explicit thought optimization, which is designed to enhance both the factuality and explainability of Med-LVLMs. GoCa consists of three key components: (1) a self-evolving CoT framework that leverages multi-agent collaboration to refine diagnostic reasoning iteratively and (2) a seamless, preference-guided optimization mechanism that distills high-quality CoT reasoning using preference tuning and (3) an adaptive Monte Carlo-like top-k selection strategy. These innovations ensure that the RAG process remains logically transparent and adaptable, significantly improving consistency when integrating retrieve contexts. Experimental results across multiple datasets on medical visual question answering (Med-VQA) demonstrate that GoCa outperforms several recent state-of-the-art methods, achieving superior factual accuracy and coherence. The code will be available.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0421_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Da1daidaidai/Goca

Link to the Dataset(s)

https://physionet.org/content/mimic-cxr-jpg/2.0.0/ https://ophai.hms.harvard.edu/code/fairclip/ https://www.kaggle.com/datasets/raddar/chest-xrays-indiana-university

BibTex

@InProceedings{DaiPen_GoCa_MICCAI2025,
        author = { Dai, Pengyu and Ou, Yafei and Yang, Yuqiao and Jin, Ze and Suzuki, Kenji},
        title = { { GoCa: Trustworthy Multi-Modal RAG with Explicit Thinking Distillation for Reliable Decision-Making in Med-LVLMs } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {258 -- 268}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a multi-modal RAG framework combating factual hallucinations in Med-LVLMs via CoT reasoning. It integrates self-evolving multi-agent collaboration, preference-guided CoT distillation, and adaptive Monte Carlo-like retrieval to enhance diagnostic transparency and accuracy.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The authors integrate explicit reasoning distillation with retrieval-augmented PT, unifying factual retrieval and interpretable CoT in medical multi-modal models, thereby addressing key limitations of conventional RAG-PT methods. By shifting Med-LVLM’s reasoning mechanism from outcome-driven to process-oriented, this work enhances the credibility and transparency of medical decision-making. 2) Comprehensive experiments across diverse datasets and ablation studies validate the framework’s effectiveness and generalizability. This method achieves SOTA performance on three benchmark datasets in terms of both Acc and F1 metrics. The significant 13.18% acc improvement over baseline methods substantiates the efficacy of the proposed thought distillation strategy.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1)The paradigm of CoT + RAG has been applied in some papers. Apart from RAG distillation, the authors should clarify the uniqueness of this paper in this regard. It is suggested to include methods that also adopt the RAG + CoT strategy in the comparative experiments. 2) The Adaptive Monte Carlo-like top-k Selection strategy proposed in the paper is effective, but the term “Monte Carlo simulation” needs further clarification, as the current rule does not appear to rely on random sampling. Instead, it depends on deterministic rules (directly calculating risk and selecting the minimum) to choose k, rather than approximating a solution through random sampling or probabilistic statistics. It is recommended to either revise the terminology or provide an explanation of the source of randomness. 3)Is the definition of FR(k) reasonable? Acc measures the proportion of correct answers, but in medical diagnosis, different errors carry varying degrees of severity. Relying solely on accuracy may overlook critical factors, such as the differing consequences of false negatives or false positives. For example, in some cases, even if the accuracy is the same, the risk of missing a severe disease may be higher, making such an evaluation metric potentially insufficient.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper shows some innovation and achieves good performance, but the differences from existing CoT+RAG methods need to be clarified.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The author answered my question.



Review #2

  • Please describe the contribution of the paper

    This paper addresses a critical issue in large medical vision-language models (Med-LVLMs): factual hallucination, which can pose serious risks in clinical settings. Existing retrieval-augmented generation (RAG) methods attempt to mitigate this but suffer from over-reliance on retrieved context and lack transparent reasoning processes. To resolve these, the authors propose GoCa, a multi-modal RAG framework that combines Chain-of-Thought (CoT) Distillation and Explicit Thought-Based Preference Optimization. The approach enhances both factuality and explainability in medical reasoning tasks.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This paper tackles the real-world issue of hallucination in Med-LVLMs and challenges in current RAG systems such as blind reliance on retrieved content and lack of reasoning transparency.
    • The proposed approach of student-teacher and supervisor-inspector-writer model mimics clinical decision-making, iteratively refining reasoning quality.
    • The explicit preference-based optimization identifies and corrects reasoning drift caused by RAG interference.
    • Instead of relying on manually labeled instruction sets, the proposed method self-generates CoT refinements, making it more scalable
    • The Monte Carlo-based selection of Top-K retrieved documents per query improves factual grounding while avoiding noise.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • It is unclear on what basis the teacher doctor evaluates whether reasoning meets diagnostic quality.
    • It is unclear how the supervisor agent determines standard compliance needs elaboration (are there explicit clinical rules, templates, metrics?).
    • Ablation study is missing. It is recommended to show the impact of extra rounds of explicit preference tuning to understand its marginal contribution to factual accuracy and reasoning quality.
    • Details are lacking about what kind of knowledge the inspector retrieves and from where (medical literature, databases, EHRs?).
    • It is not clear whether all agents are trained separately or jointly fine-tuned — this affects reproducibility and model
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • What is the stopping criterion at inference time? How does the system determine that the Chain-of-Thought (CoT) is complete and no further refinement by the agents (e.g., teacher doctor) is needed?
    • How does the model assess CoT quality during inference? Is there a factuality score, confidence threshold, or some other internal metric used to decide that the CoT meets diagnostic standards?
    • Is there any observed failure case where the refinement loop degraded the reasoning quality? Or cases where the CoT could not converge to a correct diagnosis?
    • What is the source of external knowledge that the inspector agent retrieves?
    • Are these agents trained jointly in an end-to-end pipeline or separately? If separately, how is consistency ensured between their outputs?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper tackles an important problem in the domain of medical vision-language models, specifically addressing hallucination and the limitations of retrieval-augmented generation. The use of explicit preference-based optimization for refining reasoning quality and the adaptive Monte Carlo-based retrieval strategy are also commendable. Notably, the self-generating CoT refinement without manual annotations increases the scalability of the approach. Despite these strengths, I recommend a weak reject due to several concerns that require clarification of details in methods and experiments. Also empirical analysis should be further provided to clarify the contributions of each part.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I have read other reviewer’s comments and the author’s rebuttal. Thanks the authors for clarifying important aspects during rebuttal. This paper addresses a critical challenge in multimodal medical AI. The authors provide empirical evidence for the effectiveness of their method, including quantitative metrics, ablation studies, comparative experiments. Overall, the paper could contribute to the MICCAI community in particular for factual reasoning in medical vision-language tasks. So, I would suggest accept of this paper.



Review #3

  • Please describe the contribution of the paper
    1. The authors propose a multimodal Retrieval-Augmented Generation (RAG) system that leverages Chain-of-Thought (CoT) distillation and explicit thought optimization to enhance the factuality and explainability of medical vision-language models (Med-LVLMs).

    2. The adoption of an adaptive Monte Carlo–like top-k selection strategy makes the proposed method more flexible.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The method was validated on multiple medical visual question answering (Med-VQA) datasets and outperformed several state-of-the-art approaches.

    2. An extensive ablation study was conducted to demonstrate the contribution of each component of the proposed method.

    3. The paper provides detailed descriptions of the proposed method.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The proposed method is only evaluated on the medical visual question answering (Med-VQA) task.

    2. The caption of Figure 1 could be improved to provide a clearer overview of the proposed method, making it easier for readers to follow the subsequent sections.

    3. It would strengthen the paper if the proposed method were validated on additional tasks beyond Med-VQA.

    4. The evaluation relies solely on quantitative metrics. Including qualitative examples or case studies demonstrating how the method produces better reasoning and generates more accurate answers would enhance the paper.

    5. The code has not been released at this time, which raises concerns about the reproducibility of the work—even though the authors state that it will be made available.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    1.The evaluation relies solely on quantitative metrics. Including qualitative examples or case studies demonstrating how the method produces better reasoning and generates more accurate answers would enhance the paper.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Based on the comments above, and I believe the research topic is meaningful.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.
    1. In my opinion, it is nearly impossible to reproduce this work without public access to the code. While the authors mention that they plan to release the code in the future, it is not currently available.

    2. Regarding qualitative analysis, even with page limitations, it should be possible to include one or two examples in the MICCAI version. The authors note that they will include this in an extended journal version, but the lack of qualitative results weakens the current submission.

    3. Overall, this work makes a meaningful contribution to the field of medical diagnosis.




Author Feedback

We thank all the reviewers for their valuable suggestions. We are grateful that reviewers appreciate the novelty and effectiveness (R#1,3,4), good results (R#1,3), and solid experiments (R#1,3) of our work. We address each reviewer’s questions as follows:

#R1[Beyond VQA Task] Our study focuses on hallucinations and logical opacity in the Med-VQA task, which we believe is a representative scenario in multimodal medical AI. The binary nature of VQA answers (e.g., yes/no) provides a clearer signal for factual correctness, which is central to our proposed method. We agree that RAG and multi-agent have broader task scenarios such as planning and will try to include more tasks in the journal version. [Qualitative Example] We have collected and analyzed many cases qualitatively. Due to page limitation, we will include them in the longer final version. #R1, R4 [Reproducibility] We will make our code and all prompt templates available as mentioned in the abstract. All the LLMs’ temperatures were set to be in a moderate range (0.2~0.6), so we believe the method is easy to reproduce.

#R3 [CoT+RAG method] We took the advice and conducted RAT (arXiv: 2403.05313) on our tasks and got the ACC/F1 in 3 datasets as (75.3/79.4, 62.7/73.5, 775.3/79.4), which is lower than OURS (81.1/71.5, 88.5/93.8, 84.8/88.0). In early experiments, we found that CoT+RAG did not mitigate but exacerbate the over-reliance on retrieved content, which prompted us to start exploring the part of preference tuning using exogenous CoT with good results. We will include discussion of this part in the final version. [Monte Carlo top-k] Thank you for pointing this out. The term “Monte Carlo-like” is intended to reflect the iterative exploration over the candidate top-k values, inspired by the idea of non-fixed, dynamic selection rather than a strict heuristic. We will clarify this in the revised version. [Definition of FR(k)] Indeed. In the close-set Med-VQA task, first-priority metric is commonly set to be factual consistency, e.g. accuracy (arXiv:2306.00890, arXiv:2410.13085), which is also the rule this work follows and the reason we select ACC here. In our early experiments, we revealed that there was no significant difference between the use of F1 and ACC here, while we agree that for different tasks, the definition should be different.

#R4 [Teacher Doctor/Supervisor’s CoT quality] Following prior work on multi-agent systems (arXiv:2410.02603) and medical agents (arXiv:2311.10537), we define the roles and tasks of each agent explicitly, provide few-shot prompt templates, and incorporate soft metrics such as logical consistency and medical factuality. Manual inspections confirm the effectiveness of this design, and this is a focus of our future work. We will include qualitative cases and details evaluation in the revised version.
[Training&Consistency] In the multi-agent CoT construction stage, all agents operate in a zero-shot manner through prompt engineering, which is a standard multi-agent design practice (arXiv:2405.07960). CoT-based DPO training is applied only to LLaVA, NOT agents, so there is no inconsistency between agents. Therefore, this does not affect its reproducibility or the model.
[CoT in Inference] During inference, agents are not involved, so neither stopping criterion nor assessment is required. The CoTs are generated by agents (GPT4o-mini) solely for supervising LLaVA via DPO training. [Ablation] We have already included the ablation you mentioned in Table 2 of the paper. Comparing “PT”, “C1+C2” shows a 2.6% ACC gain from extra round CoT preference tuning. [Knowledgebase of Inspector] The knowledgebase is set to be PubMed, Medical related Wiki entries, and online resources such as statppearls. We will include this in the revised version. [Failure Case] We observed that a few low-quality CoT exist when there is a severe mismatch between the retrieved text and the image. And this is the point we want to address in the journal version.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper is generally well-written and the proposed GoCa leverages both RAG and the distilled reasoning for mitigation of hallucination in QA of medical domains. All the reviewers recommend “Accept”, which indicates the techinic soundness of the proposed method. However, the authors are still encouraged to give more explanations on the methodology and the quality of the acquired CoT for reproducibility.



back to top