Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Multimodal large language models (MLLMs) have enormous potential to perform few-shot in-context learning in the context of medical image analysis. However, safe deployment of these models into real-world clinical practice requires an in-depth analysis of the accuracies of their predictions, and their associated calibration errors, particularly across different demographic subgroups. In this work, we present the first investigation into the calibration biases and demographic unfairness of MLLMs’ predictions and confidence scores in few-shot in-context learning for medical image classification. We introduce CALIN, an inference-time calibration method designed to mitigate the associated biases. Specifically, CALIN estimates the amount of calibration needed, represented by calibration matrices, using a bi-level procedure: progressing from the population level to the subgroup level prior to inference. It then applies this estimation to calibrate the predicted confidence scores during inference. Experimental results on three medical imaging datasets: PAPILA for fundus image classification, HAM10000 for skin cancer classification, and MIMIC-CXR for chest X-ray classification demonstrate CALIN’s effectiveness at ensuring fair confidence calibration in its prediction, while improving its overall prediction accuracies and exhibiting minimum fairness-utility trade-off. The codebase can be found at https://github.com/xingbpshen/medical-calibration-fairness-mllm.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4786_paper.pdf

SharedIt Link: https://rdcu.be/eHwWo

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04981-0_22

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/xingbpshen/medical-calibration-fairness-mllm

Link to the Dataset(s)

N/A

BibTex

@InProceedings{SheXin_Exposing_MICCAI2025,
        author = { Shen, Xing AND Szeto, Justin AND Li, Mingyang AND Huang, Hengguan AND Arbel, Tal},
        title = { { Exposing and Mitigating Calibration Biases and Demographic Unfairness in MLLM Few-Shot In-Context Learning for Medical Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {226 -- 236}
}

Reviews

Review #1

Please describe the contribution of the paper

(1) The proposed method addresses bias in LLM-based diagnosis; (2) it not only reduces bias across subgroups but also improves diagnostic accuracy.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Addressing bias in LLM-based diagnosis across subgroups is important in clinical practice.
2. Previous studies have primarily focused on evaluating whether bias exists in LLM-based diagnosis, while this paper proposes a method to actively address the issue.
3. The proposed method is evaluated on three datasets to demonstrate its generalizability.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The method is evaluated using GPT-4o-mini; however, private clinical data typically cannot be used as input for such models. Therefore, the effectiveness and applicability of the proposed method in real-world clinical practice remain uncertain.
2. According to the paper, the work is not easily reproducible. Although the authors claim they will release the source code and/or dataset upon acceptance, this limits the current reliability and transparency of the study.
3. The baselines used for comparison are outdated and relatively weak, which may limit the strength of the performance claims.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. The authors should first apply prompt-based few-shot or zero-shot learning directly on the test dataset to confirm the existence of bias in LLM-based diagnosis and to establish a baseline for demonstrating the effectiveness of the proposed method.
2. The proposed method should be evaluated on a broader range of LLMs to better demonstrate its generalizability.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Based on the comments above.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.
1. I would like to clarify that I mistakenly selected the wrong response for Question 9 regarding reproducibility.
  - I had intended to choose the final option: “The submission does not provide sufficient information for reproducibility.”
  - Rather than: “The authors claimed to release the source code and/or dataset upon acceptance of the submission.”
  - The paper only includes a prompt list (Table 1), and in my view, that is not sufficient to ensure reproducibility, especially for a large language model (LLM)-based approach. The authors did not state that they would release code during the rebuttal stage either. For an LLM pipeline, reproducing results typically requires access to specific prompt engineering logic, model versions, inference pipelines, and post-processing steps, which are not clearly provided in the submission.
2. While the authors may have access to the MIMIC-CXR dataset through a signed Data Use Agreement (DUA), it is important to note that uploading any part of this dataset to a third-party LLM like GPT is not permitted unless the author’s institution has signed a Business Associate Agreement (BAA) with the service provider (e.g., OpenAI). Even though MIMIC-CXR is available upon approval, it is not fully public and contains sensitive patient information, which is protected under HIPAA. Without a BAA in place, using GPT APIs to process any part of MIMIC-CXR (e.g., image captions, clinical summaries, or demographic information) could constitute a violation of the data use policy. The authors have not clarified whether such a BAA exists. To ensure compliance and ethical data handling, I believe the authors should either:
  - Confirm that their institution has a BAA with OpenAI (or equivalent), or
  - Demonstrate their approach using open-source LLMs deployed locally (e.g., LLaMA, Mistral), which do not transmit data externally.
  - This is particularly important given the clinical nature of the data and the increasing scrutiny over privacy in AI-based medical research.
3. While the authors aim to reduce bias in LLM-based clinical tasks, the current submission does not compare against existing fairness-aware medical imaging methods. Several deep learning approaches already integrate imaging and demographic data to mitigate subgroup bias. Although the authors highlight fairness in LLMs as their motivation, the current method does not involve any LLM reasoning capabilities. In this case, it is unclear whether their method offers an advantage over established bias-mitigation techniques in medical image analysis.

Review #2

Please describe the contribution of the paper

This paper introduces CALIN a post-hoc calibration method that does not require any dedicated training or validation set. The framework exploits the so called multimodal null-input probing technique to calibrate the model without additional data, aligning the predicted confidences of meaningless samples with a uniform distribution, both for population level as well as for sub-group level. The final calibration is obtained by combining these two individual ones. The evaluation consi
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is clearly written and allows the reader to understand the methodologies from the LLM literature employed. It poses on an important problem, i.e. not only mis-calibration itself, but the problem of imbalanced miscalibration across subgroups. Additionally, the absence of the need for a calibration set can be useful, as the target datasets might be smaller in-house datasets, where having a dedicated calibration set is not feasible. The evaluation is carried across a variety of metrics and on various multi-modal medical datasets, showing consistently improved results.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

I believe it would be beneficial to also mention the definition of EOR metric, the same way it is done for the calibration metrics, and briefly describe the considered baseline FS-ICL.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is clearly written and it tackles a relevant problem, which is currently under-explored in the literature. It proposes to use methods from LLM literature to mitigate the calibration gap across subgroups and carry out extensive evaluation.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I thank the authors for the rebuttal and addressing my comments. I confirm my positive feedback about this paper and I vote for acceptance.

Review #3

Please describe the contribution of the paper

This paper investigates the calibration and demographic fairness of large multimodal models (LMMs) in few-shot in-context learning (FS-ICL) for medical image classification. The authors identify significant calibration biases across demographic groups and propose CALIN, an inference-time, training-free calibration method. CALIN operates via a bi-level procedure, estimating population-level and subgroup-level calibration adjustments using null-input probing techniques. The approach is evaluated on three medical imaging datasets demonstrating improvements in confidence calibration, subgroup fairness, and prediction accuracy. To the best of the authors’ knowledge, this is the first work to systematically study calibration fairness in FS-ICL with LMMs in medical imaging.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is well-structured and well-written, with the motivation clearly presented and matched by the method and experiments, which seem sufficient to establish the interest in the method.
- The paper focuses on the calibration and fairness of large multimodal models (LMMs) under few-shot in-context learning (FS-ICL), an area of increasing relevance as foundation models are considered for deployment in sensitive domains such as medical imaging
- The proposed approach, CALIN, introduces a two-level calibration adjustment based on null-input probing. To the best of my understanding, this bi-level formulation for calibration, particularly in a black-box setting, appears to be a novel and practical contribution.
- CALIN operates entirely at inference time and does not require access to model internals or additional training data. This seems relevant as large models are often either black boxes or used as such, and training or fine-tuning can be prohibitively difficult.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The paper assumes a standard scoring mechanism for candidate labels via prompting, but does not specify how multi-token labels are handled or whether any normalization is applied. Since this step is central to computing calibrated probabilities, more detail would be helpful. Could these aspects influence the resulting confidence estimates and, in turn, affect the observed calibration and fairness outcomes? On the same line, could one suspect that the choice of labels (e.g., yes/no vs. positive/negative) is linked to the effectiveness of calibration? Possibly, another open question could be whether complex label and attribute spaces, e.g. including very technical terms or long names of rare pathologies, could have an adverse effect on the efficacy of the calibration procedure.
- Limited positioning within the few-shot learning literature. The paper could do more to situate itself relative to prior work on few-shot prompting and calibration with LLMs. As a non-expert, I found it difficult to judge what is new versus what builds on established methods.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

recommendation: accept

The paper addresses a relevant problem with a practical, clearly presented method. While some assumptions around prompting and label structure are left implicit, the empirical results are convincing, and the approach appears useful in realistic settings. I find the contribution valuable and recommend acceptance.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

Rebuttal:

We thank all reviewers for their careful evaluation of our submission. We are encouraged that all 3 reviewers acknowledged our paper as: (i) posing an important problem in the field that has not yet been addressed, (ii) introducing a novel and practical method with strong empirical evaluation, and (iii) presenting the work in a clear and well-written manner. We address any misunderstandings or questions.

Reviewer 1:

[Misunderstanding: Lack of reproducibility claim] The reviewer states that the paper itself claims the work is not reproducible, despite the paper stating that the source code will be released, thereby limiting its reproducibility. However, we did not make this claim in the paper. On the contrary, we provided exact prompts in the paper (Table 1), and we have prepared a well-documented codebase (publicly available upon acceptance of the paper) to facilitate reproducibility and ease of use of the method.

[Baselines] The reviewer commented that the baselines are outdated and weak. However, as acknowledged by all reviewers, this is the first paper to address this problem, and no established baselines currently exist. Moreover, the reviewer did not cite any alternative baselines. We compared our method against the standard approach (Vanilla baseline) and against the population-level approach (L1) in our Ablation (Sec. 4.2). The latter also serves as a strong baseline, as it can be interpreted as an extension of [1] tailored to our problem setting.

[FS Baseline] The reviewer suggested that the paper should first apply a FS-ICL method to confirm the existence of bias and to serve as a baseline. This is precisely what we did in our experiments: the Vanilla method (Table 2) corresponds to the standard FS-ICL method and serves to confirm the presence of biases.

[GPT Privacy] We appreciate the reviewer’s concern about privacy. Our use of GPT models aligns with prior FS-ICL work in medical image classification [2] (we cited [2] in Sec. 1). While privacy is important, our focus is on safe, responsible use in clinical contexts. Privacy risks are beyond the current scope. Emerging solutions (e.g., OpenAI Enterprise API) are being developed to support these concerns for future adoption.

[Evaluation on More LMMs] As this is the first paper to address this open problem, we prioritized conducting extensive experiments across multiple datasets and detailed ablation studies to establish a solid foundation. We agree that evaluating a broader range of LMMs is an important direction, and we consider this a valuable avenue for future work.

Reviewer 2:

[Metric] The reviewers asked to describe EOR and FS-ICL baseline. EOR is a metric for evaluating disparities in true/false positive rates across subgroups (see citation in Sec. 4.1). The Vanilla method is the standard FS-ICL without the proposed calibration algorithm, and serves as a direct point of comparison to assess the impact of our method.

Reviewer 3:

[Multi-Token Labels] The reviewer asked how to handle multi-token labels. One approach is to map each label to a one-character option using a multiple-choice format, and normalize as in Eq. 3. For example, “pleural effusion” maps to option “A”, and the LMM is then tasked with predicting the correct option.

[More References] References to prior work on the calibration of language models for natural language processing [3–5] can be included.

References:

[1] Zhao et al. Calibrate before use: Improving few-shot performance of language models. ICML, 2021

[2] Ferber et al. In-context learning enables multimodal large language models to classify cancer pathology images. Nat. Commun., 2024

[3] Li et al. Few-shot recalibration of language models. arXiv:2403.18286, 2024

[4] Xiong et al., Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. ICLR, 2024

[5] Xiao et al. Uncertainty quantification with pre-trained language models: A large-scale empirical analysis. EMNLP, 2022

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This submission received clear acceptance (5) recommendations from both reviewers, the main concern from R1 is the reproducibility of this work. The authors have stated in the rebuttal that the code will be publicly available upon acceptance of the paper. Considering R1’s request is not practical – code files / external links are not allowed during rebuttal, and no major concerns on this paper otherwise, My recommendation is accept.

back to top

Exposing and Mitigating Calibration Biases and Demographic Unfairness in MLLM Few-Shot In-Context Learning for Medical Image Classification

Author(s):