Abstract

Multimodal Large Language Models (MLLMs) show great potential in medical tasks, but their elicited confidence often misaligns with actual accuracy, potentially leading to misdiagnosis or overlooking correct advice. This study presents the first comprehensive analysis of the relationship between accuracy and confidence in medical MLLMs. It proposes a novel method that combines Multi-Strategy Fusion-Based Interrogation (MS-FBI) with auxiliary expert LLM assessment, aiming to improve confidence calibration in Medical Visual Question Answering (VQA). Experiments demonstrate that our method reduces the Expected Calibration Error (ECE) by an average of 40% across three Medical VQA datasets, significantly enhancing MLLMs’ reliability. The findings highlight the importance of domain-specific calibration for MLLMs in healthcare, offering a more trustworthy solution for AI-assisted diagnosis.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1840_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{DuYue_Confidence_MICCAI2025,
        author = { Du, Yuetian and Wang, Yucheng and Kong, Ming and Liang, Tian and Long, Qiang and Chen, Bingdi and Zhu, Qiang},
        title = { { Confidence Calibration for Multimodal LLMs: An Empirical Study through Medical VQA } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {89 -- 98}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper addresses the important yet under-explored problem of confidence calibration for multimodal large language models (MLLMs) in high-risk medical VQA tasks. The authors propose a novel calibration framework that combines Multi-Strategy Fusion-Based Interrogation (MS-FBI) with auxiliary expert LLM assessment. The method is evaluated across three medical VQA datasets and three MLLM backbones, demonstrating its general applicability and effectiveness.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper tackles a practically valuable and safety-critical issue—calibration of MLLM predictions in medical VQA, which is essential for real-world deployment in high-risk clinical scenarios.
    2. The proposed approach is evaluated on three datasets and three different MLLMs, showing consistent improvements, which supports the method’s generalizability.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The selection of baselines is relatively weak. For example, from reference [6], only the simplest “Vanilla” variant is used, whereas [6] proposes a more comprehensive framework involving prompting, sampling, and aggregation (e.g., “Top-K prompt + Self-Random sampling + Avg-Conf or Pair-Rank aggregation”). Similarly, from reference [7], only the “Punish” method is compared, while [7] introduces several stronger variants, including “Challenge”, “Explain”, and their combinations. The lack of comparison with stronger baselines limits the strength of the claimed improvements.
    2. The ablation study is insufficient. The proposed auxiliary expert LLM assessment component, which plays a key role in reassessing confidence, is not thoroughly evaluated via ablation, making it difficult to isolate its contribution.
    3. The evaluation focuses solely on calibration metrics such as ECE and AUC (for confidence discrimination). However, since the core task is VQA, it would be beneficial to also report the models’ answer accuracy to understand the overall utility and safety of the proposed approach.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes a meaningful contribution by addressing confidence calibration in MLLMs for medical VQA, a task of significant real-world relevance. The method is novel and performs well across multiple datasets and models. However, the evaluation setup has notable limitations: weak baseline comparisons, missing ablation for key components. These issues limit the confidence in the paper’s core claims

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    After carefully reviewing the authors’ rebuttal, I find that they have adequately addressed my primary concerns. Given these clarifications, I believe the paper is now suitable for acceptance.



Review #2

  • Please describe the contribution of the paper

    This paper explores confidence calibration for multimodal LLMs in medical VQA settings. It proposes a two-phase interrogation strategy combined with an auxiliary expert LLM to better align predicted confidence with accuracy. The method improves calibration metrics across multiple datasets and models, and offers an interpretable framework based on prompting rather than model fine-tuning.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    · The paper addresses an underexplored but important issue—confidence calibration for medical multimodal LLMs—where errors can have high-stakes consequences. · The proposed MS-FBI framework is a creative use of prompt-based multi-step interaction, requiring no model modification, and is adaptable to different MLLMs.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    · The method relies heavily on prompt engineering and an external LLM as a “calibration expert,” but does not justify whether this expert model is reliably more accurate or calibrated than the original MLLM. · Only three MLLMs are evaluated, with no comparison to recent strong baselines in VQA or other calibration methods beyond verbal confidence prompts. · While the paper focuses on calibration metrics, it does not analyze whether better calibration actually improves downstream clinical decision-making or diagnostic quality.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The idea is interesting and the results are encouraging, but the method relies heavily on prompting without deeper model design, and the evaluation setup could be more comprehensive.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The primary contribution of the paper is the comprehensive investigation and development of a novel confidence calibration method for multimodal large language models (MLLMs) in the high-stakes medical Visual Question Answering (VQA) domain.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Novel Formulation and Methodology: The combination of the MS-FBI interrogation system with an auxiliary expert LLM assessment is an innovative approach. Unlike traditional single-phase calibration methods, this dual-phase strategy allows for both an initial inspection and a deeper, reflective probing of the model’s reasoning process. The use of punishment, challenge, and explanation prompts in a systematic framework is novel. It draws inspiration from cognitive restructuring principles and lie detection to elicit more calibrated and reliable confidence scores. Effective Use of Data: The paper performs a comprehensive empirical study on multiple publicly available Medical VQA datasets, ensuring robust evaluation across various clinical scenarios. The modular design of the framework (with interchangeable strategy components) allows for flexible adaptation to different models and datasets, which is a strong aspect given the diversity of medical imaging and question-answering tasks.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While the multi-phase interrogation framework is innovative, its reliance on specifically designed prompt templates (punishment, challenge, and explanation) may pose reproducibility issues. The method might require extensive tuning for different setups, which could limit its general applicability. Similar prompt-based calibration techniques have been explored in general LLM calibration literature (e.g., using chain-of-thought or reinforcement learning adaptations). The reliance on verbalized confidence and specific punishment cues might not be entirely novel and could be seen as an incremental improvement rather than a fundamentally new direction.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The related work on Medical VQA is quited limited, please refer to Vqamix: Conditional triplet mixup for medical visual question answering, which first handle the confidence caliberation in MedVQA task.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The ignore of reasoning-based VLLMs makes it not reach a higher score.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I have no further concerns.




Author Feedback

Thank you for the detailed and constructive feedback! We treasure the opportunity to address your concerns and improve our work. Next, we will address the reviewers’ concerns point by point.

Reviewer 1 Q1: Extended Baseline Experiments To comprehensively evaluate method performance, we supplemented comparative experiments of multiple strategies from [1] on the VQA-RAD dataset (using LLaVA-NeXT-7B), with results as follows: Method ECE (↓) AUC (↑) Challenge 41.05 50.98 Explain 36.18 54.33 Punish+Explain 39.52 55.10 MS-FBI (Ours) 20.90 56.88 The experiments demonstrate that our method significantly outperforms existing strategies. Note that the approach in [2] relies on multiple sampling and answer aggregation, whereas our baselines use single sampling, making direct comparisons potentially biased. Future work will integrate such ideas to further enhance framework robustness.

Q2: Ablation Study Supplement In response to your suggestion, we added an ablation study using only the Punish strategy (averaged across three datasets): Model ECE (↓) AUC (↑) LLaVA-1.5-med-7B 23.56 49.98 LLaVA-NeXT-7B 19.07 52.93 Molmo-7B 15.51 50.37 Results show that while the Punish strategy alone optimizes some ECE metrics, AUC improvements remain limited, indicating that single-round interrogation is insufficient for thorough calibration. Additionally, by comparing rows 1,2,4 of Table 2 (w expert models) against rows 1-3 in R1Q1’s table (w/o expert models), the critical role of expert models in the MS-FBI system is clearly validated.

Q3: Model Performance Validation In practical applications, a simple strategy can be adopted to feed calibration results (confidence vs. accuracy) back into performance. Based on the confidence threshold method from [3] (we chose a threshold of ≥80%), we directly accepted high-confidence samples and repeated sampling for low-confidence ones. The comparison before and after calibration is as follows: Condition Accuracy (%) Before-calibration 60.59 After-calibration 64.21 The results confirm that MS-FBI effectively enhances the diagnostic reliability of medical MLLMs.

Reviewer 2 Q1: Generalization Validation We further tested the generalizability of MS-FBI on the recently released medical reasoning model MedVLM-R1 [4]: Method ECE (↓) AUC (↑) Vanilla 30.46 44.81 Punish 33.44 42.38 Top-k 18.41 52.58 MS-FBI 14.06 55.62 Results demonstrate strong cross-model adaptability.

Reviewer 3 Q1: Expert Model vs. Original Model We acknowledge that the discriminative capability of the expert model itself is indeed important. However, it should be noted that the expert model in this paper (based on LLM) is used for contextual analysis and decision-making of the interrogation results, and it does not possess cross-state understanding capabilities. Therefore, the confidence calibration of VOA does not entirely rely on the ability of the expert model. A more powerful expert model may further improve the calibration effect, but this is not the core contribution of this paper.

Q2: Refer to the results in R1Q1(Extended Baseline Experiments) and R2(Generalization Validation).

Q3: Refer to the results in R1Q3(Model Performance Validation).

[1] When Do LLMs Need Retrieval Augmentation? Mitigating LLMs’ Overconfidence Helps Retrieval Augmentation [2] Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs [3] Optimizing GPT-4 Turbo Diagnostic Accuracy in Neuroradiology through Prompt Engineering and Confidence Thresholds [4] MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    The paper presents a novel multi-phase framework for improving confidence calibration in medical visual question answering (VQA), addressing an important and underexplored challenge. Reviewers commend the paper’s motivation and practical relevance, particularly its effort to enhance interpretability and trustworthiness in clinical AI systems. However, they also point out concerns such as the weak selection of baselines, reliance on prompt engineering without sufficient justification, and a somewhat limited discussion of related work in medical VQA. These issues impact the clarity and completeness of the contribution. I recommend inviting the authors to submit a rebuttal to address these methodological and framing concerns.

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    While the core ideas may build on earlier prompting-based calibration strategies, their integration in a medical VQA context with the MS-FBI dual-stage framework is novel and impactful. The rebuttal meaningfully strengthens the paper and alleviates major concerns. Overall, this is a valuable empirical contribution.



back to top