Abstract

Multimodal large language models (MLLMs) have demonstrated significant potential in medical Visual Question Answering (VQA). Yet, they remain prone to hallucinations—incorrect responses that contradict input images, posing substantial risks in clinical decision-making. Detecting these hallucinations is essential for establishing trust in MLLMs among clinicians and patients, thereby enabling their real-world adoption. Current hallucination detection methods, especially semantic entropy (SE), have demonstrated promising hallucination detection capacity for LLMs. However, adapting SE to medical MLLMs by incorporating visual perturbations presents a dilemma. Weak perturbations preserve image content and ensure clinical validity, but may be overlooked by medical MLLMs, which tend to over-rely on language priors. In contrast, strong perturbations can distort essential diagnostic features, compromising clinical interpretation. To address this issue, we propose Vision Amplified Semantic Entropy (VASE), which incorporates weak image transformations and amplifies the impact of visual input, to improve hallucination detection in medical VQA. We first estimate the semantic predictive distribution under weak visual transformations to preserve clinical validity, and then amplify visual influence by contrasting this distribution with that derived from a distorted image. The entropy of the resulting distribution is estimated as VASE. Experiments on two medical open-ended VQA datasets demonstrate that VASE consistently outperforms existing hallucination detection methods. The code will be available at https://github.com/Merrical/VASE.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0083_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Merrical/VASE

Link to the Dataset(s)

MIMIC-Diff-VQA dataset: https://physionet.org/content/medical-diff-vqa/1.0.1/ VQA-RAD dataset: https://huggingface.co/datasets/flaviagiammarino/vqa-rad

BibTex

@InProceedings{LiaZeh_VisionAmplified_MICCAI2025,
        author = { Liao, Zehui and Hu, Shishuai and Zou, Ke and Fu, Huazhu and Zhen, Liangli and Xia, Yong},
        title = { { Vision-Amplified Semantic Entropy for Hallucination Detection in Medical Visual Question Answering } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {672 -- 682}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces Vision‑Amplified Semantic Entropy (VASE), an uncertainty‑based method for detecting hallucinations in medical VQA settings. They tested the proposed methods in two VQA datasets with two MLLM/VLM models (CheXagent and LLaVA‑Med backbones).

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Evaluation on two large-scale and clinically relevant datasets (MIMIC‑Diff‑VQA; VQA‑RAD) shows top AUC gains (up to 1.3% improvement over SE) and higher AUG scores across both CheXagent and LLaVA‑Med backbones.

    2. VASE combines weak, clinically valid image augmentations with a contrastive mechanism to amplify the model’s sensitivity to visual evidence, which can preserve diagnostic features while overcoming language bias.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The main idea of this paper is based on “Detecting hallucinations in large language models using semantic entropy”, combining with uncertainty estimation with visual perturbations or grounding. I don’t think the improvement is big enough for a MICCAI paper. In addition, they should compare it with other baselines like VL‑Uncertainty ([30] in their paper).

    2. The work lacks a user‑study or expert alignment demonstrating that lowering entropy truly corresponds to increased clinician trust or decision‑making accuracy in practice.

    3. The decision threshold tau is said to be “determined using the validation set” without a clear procedure or robustness analysis (e.g. sensitivity to tau).

    4. Experiments cover only chest‑X‑ray and generic radiology VQA; the method’s generalizability to other modalities (e.g. pathology slides, ultrasound) remains untested. You can try other VQA dataset for more experiments such as PathVQA.

    Minor problems: No analysis of run‑time or resource cost is provided.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While VASE presents a adaptation method of semantic entropy to the medical domain and demonstrates consistent empirical gains, its conceptual advance is incremental over prior uncertainty‑ and grounding‑based approaches, and the manuscript does not convincingly establish clinical impact and prove it with medical experts. It leads me to give this paper a reject: the paper makes some empirical contributions, yet falls short of the originality and clinical validation bar required for acceptance of MICCAI.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    Thanks for the authors’ reply. Some of my concerns (1, 3) are resolved but the others are still major limitations. Thus, I will maintain the reject score (increase from reject to weak reject).



Review #2

  • Please describe the contribution of the paper

    This paper proposed Vision Amplified Semantic Entropy (VASE) that incorporates weak image transformations and increase the impact of the visual input as opposed to the text input to enhance hallucination detection performance in medical VQA.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper clearly categorizes previous works in hallucination detection into five types, including uncertainty estimation, cross-checking, external fact retrieval, etc., and clearly states the limitation to address.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Regarding the limitation of previous methods that strong image perturbations change clinical findings, this may not hold when applying color distortions to chest X-rays in the case shown in Fig 1, where the detection of atelectasis can largely rely on spatial information, i.e., lower-left corner of the token sequence having high intensities. The authors should support their claim by clarifying what kinds of transfomations were used in prior works.

    • The proposed approach to enhance the impact of the visual input in MLLMs is not clearly described in the introduction section. It’s difficult to connect the idea of using additional distorted image with increasing the impact of the image on MLLMs. In my opinion, to increas the impact of vision, a straightfoward way is to reduce the number of hints (e.g., disease names) in the text, so that the model has to figure it out based largely on the image. And if it is uncertain, then it may be biased to the text and not trustworthy. The authors should provide the high-level idea of the proposed approach in the introduction section.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well organized and clearly written, albeit the high-level idea of the proposed approach is diffcult to sink in based on the description.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    This rebuttal does not clearly justify the use of weak augmentation. The argument that strong augmentation distorts nuanced appearance for diagnosis is very heuristic and was not validated through model prediction or a reader study.

    The proposed visual contrasting, which uses weighted sum and subtraction, is incremental.



Review #3

  • Please describe the contribution of the paper

    The main contribution of the paper is that the author, based on the semantic entropy work, further proposes a Vision-Amplified Semantic Entropy (VASE) method for medical VQA.

    1) It introduces contrastive semantic distributions between original and distorted images to suppress language biases and amplify visual evidence.
    2) It incorporates weak visual transformations to simulate distribution shifts and improve the robustness of uncertainty estimation.
    3) It achieves state-of-the-art hallucination detection performance on MIMIC-Diff-VQA and VQA-RAD datasets, outperforming previous approaches like SE.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The paper is well written with a clear structure. The motivation, method, and experiments are easy to follow.

    2) The idea of detecting hallucination by introducing slight perturbations to the image is interesting. Instead of only relying on response sampling (as previous methods like SE), the paper proposes to use visual distortions to amplify semantic uncertainty. This contrastive approach is simple but effective, and sheds light on how medical VQA models sometimes over-rely on language priors.

    3) The experiments are clean and well designed. The authors compare against strong baselines on two standard medical VQA datasets, and also perform detailed ablation studies to isolate the effects of each component (visual transformation and contrastive learning).

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) Lack of Threshold (τ) Details: One of the main weaknesses I found is the lack of details regarding the threshold (τ) used in the method. The paper doesn’t provide enough information on how the threshold is selected and what the specific values of τ are for the two datasets used in the experiments. It would be helpful to include more information on the process of selecting an appropriate threshold and the impact of different threshold values on performance. Additionally, it would be useful to provide more details about the threshold setting for the SE method also other comparision methods on the Table 1, as a comparison could add clarity to the performance improvements claimed by the authors.

    2) Exploration of Threshold Sensitivity: It would be valuable for future work to include experiments that explore the sensitivity of the threshold (τ). Understanding how variations in the threshold affect the performance (e.g., AUC and AUG scores) could provide deeper insights into the robustness of the method and help optimize its real-world application.

    3) Limited Analysis of Visual Transformations: Another area for improvement is the limited analysis of visual transformations. The paper mentions using various transformations to amplify the influence of images in the pipeline, such as weak and strong visual transformations. However, the paper lacks a detailed analysis of how each specific transformation affects the results. It would be beneficial to include a more thorough exploration of how different types of image transformations (e.g., cropping, noise addition, rotation) impact the final performance, especially regarding hallucination detection. A breakdown of the impact of each transformation on the uncertainty estimation and hallucination detection accuracy would be valuable in guiding future research.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is overall interesting and valuable, addressing the important challenge of hallucination detection in medical VQA. However, there are some implementation details that could be clarified, particularly the selection of the threshold (τ) and other detials see above.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    My concerns have been addressed.




Author Feedback

We thank all reviewers for their valuable time and feedback. The references appearing below are all from the paper.

R4&R5-Q1 Threshold τ:

We determine τ on the validation set using labeled hallucination and non-hallucination samples. By analyzing their VASE score distributions, τ can be selected based on desired trade-offs (e.g., high precision or recall). This is a common practice in uncertainty-based methods [6,17,22,29].

Importantly, we respectfully clarify that, similar to SE [6], our evaluation metrics, AUC and AUG, are threshold-free. AUC measures the probability that a hallucinated sample is ranked above a non-hallucinated one by uncertainty score, reflecting ranking quality without requiring a fixed cutoff. AUG aggregates the model’s correctness (mean GREEN score) for the top X% most-confident samples across all confidence percentiles, where X ranges from 1 to 100 in increments of 1. Using threshold-free metrics ensures fairer comparisons for VASE and baselines, as they don’t rely on specific τ values.

R3-Q1 Effect of Strong Perturbations:

Prior work like VL-Uncertainty [30] uses strong perturbations (e.g., Gaussian blur with radii up to 1.4). Such blurs can distort fine-grained details critical for diagnosing conditions like pneumothorax or lung opacity, justifying our focus on weak, clinically valid transformations combined with visual contrasting.

R3-Q2 Clarification of Visual Contrasting:

We compute two predictive semantic distributions: P (from original image-text input) and P’ (from distorted image-text input). The difference between P and P′ reflects the model’s sensitivity to visual input. Our visual contrasting mechanism (weighted subtraction: (1+α)P−αP’) amplifies responses that change significantly with visual alteration (i.e., are more visually grounded) and suppresses those primarily driven by language priors (which would be similar in P and P′). This enhances the visual input in the final entropy estimation. We will significantly clarify this high-level concept in the Introduction.

Regarding reducing textual hints, medical VQA questions are already concise (e.g., “What abnormalities are seen in this image?”), and removing tokens may hurt clarity. Prior work shows MLLMs rely more on text than image tokens, motivating our explicit visual amplification.

R4-Q1 Major Contribution:

Our contribution is not just combining SE with visual perturbations, but critically rebalancing visual and textual modalities for entropy estimation in medical VQA. VASE’s core novelty is Visual Contrasting, specifically designed to counteract MLLMs’ over-reliance on language priors—a key limitation unaddressed by methods like VL-Uncertainty. The performance of VL-Uncertainty is reported in Table 2 (row 2), where VASE outperforms it by 2.09% in AUC and 2.56% in AUG.

R4-Q2 Linking Entropy and Accuracy:

Fig. 2 shows that lower entropy correlates with higher GREEN scores (accuracy). Based on a user study in [21], GREEN shows the strongest alignment with radiologist assessments among several metrics. This supports using entropy filtering to improve accuracy, consistent with prior uncertainty-based methods [6, 17, 29]. While user study on trust is valuable future work, current results establish VASE’sutility in identifying more accurate answers.

R4-Q3 Generalization to More Modalities:

Results on two datasets and two MLLMs already demonstrate VASE’s effectiveness. Extending to other modalities like PathVQA requires hallucination labels, but GREEN—the model used for labeling—is trained on radiology and cannot generalize to other domains. This limits additional experiments during rebuttal, but we plan to address this in our future work.

R5-Q1 Analysis of Visual Transformations:

We appreciate the suggestion. We focused on demonstrating the overall effectiveness of visual transformations and visual contrasting due to space limits, and plan to include an analysis of each transformation type in the journal version.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This submission received mixed reviews, with one reviewer recommending acceptance and two recommending rejection, despite some initial interest. Reviewers raised concerns regarding the incremental nature of the proposed contrastive formulation, the lack of a sensitivity analysis for the decision threshold, and the limited exploration of the visual perturbations used to amplify semantic entropy.

    Despite these limitations, I believe the paper offers added value to the MICCAI community. The proposed method builds upon the semantic entropy framework in a manner well-adapted to the domain-specific challenges of medical VQA, where hallucinations can have high clinical cost. While conceptually simple, the method offers a compelling shift in emphasis—from textual to visual grounding—by introducing contrastive signals derived from weak image transformations.

    The empirical evaluation is carefully executed and includes comparisons to relevant baselines and meaningful ablations. The gains reported, though moderate, are consistent and observed across two VLM backbones.

    That said, the authors should relax or clarify the assumption that weak perturbations alone suffice to demonstrate language suppression. For example, if the image encoder is robust to such perturbations (which ideally they should be), then contrastive semantic entropy might yield weak effects—without directly proving that the model is driven by language priors over visual content. This limitation does not invalidate the method but should be acknowledged more explicitly in the revised manuscript.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper introduces a visual semantic entropy for detecting hallucinations in medical VLMs. To preserve weak image distortions while amplifying visually relevant differences, the authors propose a vision-contrasting. The technical contributions are clearly articulated. Experiments demonstrate improvements over existing hallucination detection baselines on two medical VQA datasets and across two VLM backbones. To further strengthen the work, the authors could add statistical analyses of the results, such as p-values and confidence intervals, and investigate the impact of visual transformations as suggested.



back to top