Abstract

Vision-Language Models (VLMs) are becoming increasingly popular in the medical domain, bridging the gap between medical images and clinical language. Existing VLMs demonstrate an impressive ability to comprehend medical images and text queries to generate detailed, descriptive diagnostic medical reports. However, hallucination–the tendency to generate descriptions that are inconsistent with the visual content–remains a significant issue in VLMs, with particularly severe implications in the medical field. To facilitate VLM research on gastrointestinal (GI) image analysis and study hallucination, we curate a multimodal image-text GI dataset: Gut-VLM. This dataset is created using a two-stage pipeline: first, descriptive medical reports of Kvasir-v2 images are generated using ChatGPT, which introduces some hallucinated or incorrect texts. In the second stage, medical experts systematically review these reports, and identify and correct potential inaccuracies to ensure high-quality, clinically reliable annotations. Unlike traditional datasets that contain only descriptive texts, our dataset also features tags identifying hallucinated sentences and their corresponding corrections. A common approach to reducing hallucination in VLM is to finetune the model on a small-scale, problem-specific dataset. However, we take a different strategy using our dataset. Instead of finetuning the VLM solely for generating textual reports, we finetune it to detect and correct hallucinations, an approach we call hallucination-aware finetuning. Our results show that this approach is better than simply finetuning for descriptive report generation. Additionally, we conduct an extensive evaluation of state-of-the-art VLMs across several metrics, establishing a benchmark. Dataset and code available: \href{https://github.com/bhattarailab/Hallucination-Aware-VLM}{bhattarailab/Hallucination-Aware-VLM.}

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0774_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/bhattarailab/Hallucination-Aware-VLM

Link to the Dataset(s)

The proposed Gut-VLM dataset will be available in: https://github.com/bhattarailab/Hallucination-Aware-VLM

BibTex

@InProceedings{KhaBid_HallucinationAware_MICCAI2025,
        author = { Khanal, Bidur and Pokhrel, Sandesh and Bhandari, Sanjay and Rana, Ramesh and Shrestha, Nikesh and Gurung, Ram B. and Linte, Cristian and Watson, Angus and Shrestha, Yash R. and Bhattarai, Binod},
        title = { { Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {234 -- 244}
}


Reviews

Review #1

  • Please describe the contribution of the paper
    • Curated Multi-modal Image-Text GI Dataset: The paper introduces Gut-VLM, a multi-modal GI image-text dataset, which the authors intend to release publicly.
    • Hallucination-aware Fine-tuning: The proposed method incorporates hallucinated sentences along with their corresponding corrections to introduce a novel hallucination-aware fine-tuning approach.
    • Benchmarking with State-of-the-art VLMs: The dataset is used to fine-tune various state-of-the-art vision-language models (VLMs), enabling a comparison between models trained with and without hallucination correction. The results demonstrate notable performance gains with the hallucination-aware setup.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Novel Multi-modal Image-Text GI Dataset: The paper presents a new multi-modal GI image-text dataset, featuring diagnostic reports generated by VLMs and subsequently verified by physicians. The authors intend to release this dataset publicly.
    • Hallucination-aware Fine-tuning: The proposed hallucination-aware fine-tuning strategy leads to notable performance improvements, as demonstrated in the results. The approach is interesting.
    • LLM-assisted Evaluation Metrics: Two LLM-assisted metrics, R-Sim and QAAS, are introduced to evaluate coarse-level semantic similarity, and to handle synonyms, and similar phrasing in VQAs.
    • Clarity and Presentation: The paper is well-written, well-structured, and easy to follow.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Dataset Composition: In Section 2.1, while various class names are mentioned, the exact number of classes should be clearly specified. Additionally, it would be helpful to include the distribution across classes to assess whether the dataset has sufficient samples per class for training. The results also do not clarify whether any specific classes were consistently missed in the generated reports.
    • Distinction from MedVQA-GI: The paper should clearly articulate how the proposed dataset differs from the one used in the MedVQA-GI challenge, to avoid confusion and to highlight its unique contributions.
    • Clarity on Public Dataset Components: It remains somewhat unclear which parts of the dataset will be made publicly available (for example, whether hallucinated reports, corrected reports, and VQA pairs will all be included).
    • Description about R-Sim and QAAS: The two proposed evaluation metrics R-Sim and QAAS need more description. An example for each can be included for clarity.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Useful Dataset and Promising Approach: The paper presents a new GI image-text dataset and a fine-tuning method for VLMs that shows strong results. Both contributions are timely and relevant.
    • Missing Details: The paper lacks important details about the dataset, such as class distribution and what parts will be shared publicly. The new evaluation metrics (R-Sim and QAAS) also need clearer explanation, possibly with examples.

    Overall, the work is promising but would benefit from more clarity and completeness in key areas.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors presented a human-verified multimodal image-text gastrointestinal dataset: Gut-VLM for GI image analysis from the images of Kvasir-v2. This dataset focuses on 12 diagnostic aspects such as anatomical class, polyp count, and medical findings. Based on the presented dataset, the authors further introduced a hallucination-aware training that trains the VLM first to identify hallucination then corrects it. The experiments show that the proposed method outperforms standard SFT on Gut-VLM by a large margin.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The curation of dataset is solid and impactful.
    2. The proposed hallucination-aware training seems to be effective.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The detail of hallucination-aware training is missing. The step 1 and 2 in the Fig. 2 are not mentioned in the paragraph. The reviewers would like to know how exactly the two-step training is done.
    2. The experiment of comparing the proposed method and other hallucination reducing methods is missing. Also, the authors should conduct experiments on Kvasir-VQA since it is mentioned in the manuscript.
    3. The reviewer would like to know why the performance of hallucination-aware training is lower compared to standard fine-tuning in anatomical class.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is effective on the proposed Gut-VLM dataset, which is solid and might be impactful on the community. However, the experiments are not comprehensive.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    a) Creation of Gut-VLM Dataset: A novel GI image-text dataset derived from Kvasir-v2, featuring ChatGPT-generated reports corrected by gastroenterologists. Each sentence is tagged for hallucinations with corresponding expert-corrected versions.

    b) Benchmarking SOTA VLMs: Systematic evaluation of four Vision-Language Models (LLaVA, DeepSeek, Qwen, and mPLUG-Owl) on GI image understanding, using standard and novel LLM-assisted metrics (R-Sim, QAAS).

    c) Hallucination-Aware Finetuning: A new finetuning strategy where models are trained to detect and correct hallucinations rather than only regenerate reports, resulting in consistently higher performance across evaluation metrics.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    a) Provides fine-grained, expert-validated hallucination annotations, which are rare and valuable for training more trustworthy models. b) Hallucination-aware finetuning consistently outperforms traditional methods across models and tasks, including clinical expert scoring.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    a) Only responses from ChatGPT were corrected, potentially biasing the structure and content of ground-truth data toward ChatGPT’s language. b) Using ChatGPT both as a generator and evaluator (R-Sim and QAAS) could introduce unintentional bias.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper makes a significant and original contribution to medical vision-language modeling by addressing hallucinations in a clinically meaningful and methodologically novel way. Despite some limitations in scope and annotation granularity, the introduction of Gut-VLM and hallucination-aware finetuning strategy will likely benefit the broader community and spur further research. The benchmarking is rigorous, and the evaluation pipeline is thoughtfully designed. It’s a good candidate for acceptance.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank all the reviewers for your evaluation and constructive feedback. The reviewers have acknowledged the novelty of our dataset and its potential impact [R1, R2, R3], the strength of hallucination-aware finetuning [R1, R2, R3], evaluation protocols [R1, R3], and strong performance results and comparison [R1, R2, R3], and clarity in presentation [R1]. We have addressed some of the reviewers’ concerns below:

Dataset composition, distinction, and release components [R1]: 1) Thank you for your suggestions. We will include class count in the final manuscript. Because of space constraints, we omitted several studies but plan to expand this work into a journal version with further analysis later. 2) Our dataset uses questions from MedVQA-GI to prompt VLMs to generate descriptive text responses, but MedVQA-GI only covers short yes/no-type and short answers. As noted in the introduction section, our dataset provides both descriptive VLM responses and hallucination tags from experts, which is important for addressing hallucination. We will make this clearer in the final version to avoid any confusion. 3) We will make the hallucination tags, corrected reports, and VQA pairs publicly available in the release.

Details about R-Sim and QAAS [R1]: Thank you for this suggestion. We will clarify the metrics further with an example in the final camera-ready version.

Hallucination-aware fine-tuning details [R2]: Thank you for pointing this out. Unfortunately, we had to limit this section due to the original page constraints. We will add more details to the camera-ready version, as it allows additional space.

Additional comparison and analysis [R2]: We acknowledge the lack of comparisons with existing hallucination reduction methods. We plan to further expand our benchmarks and methodological contribution in the journal version.

Results in anatomical class [R2]: Thank you for this insightful question. Since the anatomical classes are limited, well-defined, and consistent in the corrected version, there is less room for hallucination; hence, standard finetuning alone performed well in this category. In contrast, hallucination-aware finetuning might have introduced some errors by becoming more cautious and less confident, generating overly conservative responses. Although we do not have experiments to conclusively confirm this, it appears likely and could be a direction for future exploration.

ChatGPT limitation [R3]: We agree with the reviewer’s comment that using only ChatGPT-corrected responses might limit the scope and introduce some bias. While we ensured consistent evaluation across all benchmarks, incorporating an ensemble of various independent LLMs could enhance the robustness of the evaluation. We plan to consider this in future extended work.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top