Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Vision language models (VLMs) show promise in medical diagnosis, but their performance across demographic subgroups when using in-context learning (ICL) remains poorly understood. We examine how the demographic composition of demonstration examples affects VLM performance in two medical imaging tasks: skin lesion malignancy prediction and pneumothorax detection from chest radiographs. Our analysis reveals that ICL influences model predictions through multiple mechanisms: (1) ICL allows VLMs to learn subgroup-specific disease base rates from prompts and (2) ICL leads VLMs to make predictions that perform differently across demographic groups, even after controlling for subgroup-specific disease base rates. Our empirical results inform best-practices for prompting current VLMs (specifically examining demographic subgroup performance, and matching base rates of labels to target distribution at a bulk level and within subgroups), while also suggesting next steps for improving our theoretical understanding of these models.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3276_paper.pdf

SharedIt Link: https://rdcu.be/eHwV4

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04981-0_9

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/DaneshjouLab/BiasICL

Link to the Dataset(s)

https://ddi-dataset.github.io/ https://aimi.stanford.edu/datasets/chexpert-chest-x-rays

BibTex

@InProceedings{XuSon_BiasICL_MICCAI2025,
        author = { Xu, Sonnet AND Janizek, Joseph D. AND Jiang, Yixing AND Daneshjou, Roxana},
        title = { { BiasICL: In-Context Learning and Demographic Biases of Vision Language Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {88 -- 97}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper presents an empirical study evaluating the fairness of 3 commercial Vision-Language Models (VLMs) in two medical imaging tasks (malignant skin lesion detection and pneumothorax detection) across demographic subgroups, with a focus on In-Context Learning (ICL). The study provides three key insights: (1) Majority Label Bias: VLMs exhibit a bias toward the most frequent label in the few-shot examples provided in the prompt. (2) Group Disease Prevalence Impact: Performance disparities arise when there are differences in disease prevalence between demographic groups in the few-shot examples. (3) Trade-offs in Performance Scaling: Increasing the number of in-context examples can improve performance for one subgroup while degrading it for another.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Strengths of the Paper:

(1) While prior work has primarily examined fairness in VLMs under zero-shot settings, this paper provides one of the first empirical investigations into fairness in ICL. The study evaluates three state-of-the-art VLMs across two distinct medical imaging tasks (malignant skin lesions and pneumothorax detection), ensuring robustness and generalizability.

(2) The paper advances our understanding of VLM biases by (a) demonstrating how differences in disease prevalence between demographic groups in few-shot examples can lead to performance disparities and demonstrating the trade-off between subgroup performances when scaling example counts. (b) validating prior observations (e.g., majority-label bias in ICL.

(3) The use of public datasets and the release of an open-source repository enhance the reproducibility of the work, facilitating future research in this line.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The paper could be improved by addressing the following limitations:

(1) The datasets (especially DDI) seem too small to support robust, generalizable conclusions. For example (assuming a stratified split) the number of positive samples in the test set per group is ~12 for this dataset. In the case of CheXpert, a subset of 500 samples is used. It is important to clarify how this subset was selected, as other biases could influence the outcomes.

(2) The authors acknowledge that minor prompt variations can significantly affect results. Given this sensitivity, it is worth questioning whether the chosen prompt structure (following Jiang et al.) might have systematically influenced model outputs. How resilient are the conclusions to changes in prompt phrasing? For example, Claude’s system instruction explicitly asks the model to ensure demographic fairness. Could this be the reason why it abstains a lot? A brief ablation or discussion around alternative prompt formats would strengthen the robustness of the findings.

(3) Clarity in Methodological Notation (Section 2.4): The exposition in Section 2.4 would benefit from clearer definitions of the variables f, g, x, and y, along with a discussion of their potential values. For example, what does g(x) = 0 represent in the context of each dataset? A concise table or illustrative example might aid comprehension for readers unfamiliar with this notation.

(4) The use of only three independent runs to assess variability is insufficient for drawing statistically sound conclusions. Increasing the number of repetitions would allow for more reliable confidence intervals and better insight into variability across runs.

(5) The authors could strengthen their quantitative analysis by incorporating appropriate statistical tests to examine the relationship between the variables they alter in each experiment and model outcomes. For instance, applying Kendall’s Tau for testing monotonic increase could help check whether increasing the number of positive examples consistently leads to more positive predictions.

Other minor comments:

(a) In Figure 1 (b–d), including a y-axis indicating the exact number of examples incorporated into the prompt would greatly improve the clarity. (b) It is interesting to observe that while GPT appears more accurate at predicting sex from x-rays, its performance on pneumothorax prediction exhibits less variability compared to other models. Do the authors have a hypothesis for this phenomenon?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

For an extension, the authors could make use of the datasets in the FairMedFM benchmark (Jin, Yu, Zhong et al., NeurIPS 2024).
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the paper introduces new insights that can improve our understanding of bias in VLMs when using ICL, I find that there are some factors that could have led to some of the conclusions drawn in the paper, especially the prompt used and the size of the datasets used. Therefore, I would recommend to improve the paper by including more ablation or clarification on the impact of the phrasing of the prompt (not only the examples included in it) and improve the assessment of the significance of the conclusions based on the quantitative results.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

While I still have some concerns regarding the prompt and sample sizes, I find that the authors addressed most of my comments and will implement them in the final version of the paper. Therefore I find that the paper is suitable and offers interesting insights for the MICCAI community.

Review #2

Please describe the contribution of the paper

The authors study the effect of in-context learning (ICL) on the overall performance and demographic biases of general-purpose vision-language models in medical image analysis. They demonstrate that i) model predictions are generally strongly influenced by the base rate in demonstrations, ii) these effects are specific to demographic groups, i.e., groups with a higher base rate in the demonstration set will receive higher model predictions, iii) providing more demonstration examples - even at equal base rates among demographic groups - can (unexpectedly?) strongly improve performance in one demographic group directly at the expense of another.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The study is very well written, well-constructed, and easy to follow. The results are clearly presented and convincing. ICL using VLMs is an emerging paradigm in MIA, and carefully assessing the merits and limitations of this approach is important and timely. This study provides a clear and concise contribution to this emerging research field.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The study does not in any way assess or speculate on the causes of the observed results, and no mitigation techniques other than carefully checking for demographic biases are proposed. No insights beyond the experimental results themselves (which are interesting!) are provided.

ICL has received much attention in the broader ML literature, and I believe not all relevant prior work is appropriately discussed. Various improvements to the basic ICL scheme, in particular regarding output calibration, have been proposed. For example, [1-4] seem relevant and should be discussed. Possibly, some of these approaches should be implemented as well, where feasible.

No non-ICL baseline method is included in the experiments. To provide some context for the results obtained here, it would seem important to compare the ICL results to some more standard few-shot learning approach, such as fitting a small classification head to a medical foundation model, e.g. MedImageInsight [5], UMedPT [6] or similar.

[1] Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering, https://openreview.net/forum?id=L3FHMoKZcS [2] What Makes Good Examples for Visual In-Context Learning?, https://proceedings.neurips.cc/paper_files/paper/2023/hash/398ae57ed4fda79d0781c65c926d667b-Abstract-Conference.html [3] A Study on the Calibration of In-context Learning, https://aclanthology.org/2024.naacl-long.340.pdf [4] Embedded prompt tuning: Towards enhanced calibration of pretrained models for medical images, https://doi.org/10.1016/j.media.2024.103258 [5] MedImageInsight: AN OPEN-SOURCE EMBEDDING MODEL FOR GENERAL DOMAIN MEDICAL IMAGING, https://arxiv.org/pdf/2410.06542 [6] Overcoming data scarcity in biomedical imaging with a foundational multi-task model, https://www.nature.com/articles/s43588-024-00662-z
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Why are none of the many medical VLMs included in the experiments? The authors mention several criteria based on which they selected models; could they elaborate on which of these criteria led to the exclusion of such models? This seems important as the use of a medical VLM would seem like a much more reasonable approach compared to just using a general-purpose VLM.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The experiments are interesting, insightful, and very clearly presented. The manuscript could be further strengthened by investigating some previously proposed potential mitigation approaches or at least discussing such methods appropriately, as well as by including some non-ICL baseline method for comparison.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

As stated in my initial review, I consider this a concise and very clearly presented contribution on ICL as an important, recently emerging paradigm. The authors’ response to the reviews is comprehensive and, in my opinion, quite effectively rebuts the initial concerns and adds further important context.

Review #3

Please describe the contribution of the paper

The paper examines how the demographic composition of demonstration examples for in-context learning affects VLMs performance.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is very well written, interesting, and easy to follow
- The discussion of the results is excellent and provided useful insights
- Code and prompts will be released
- Overall very high quality
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The main weakness of the paper is that the experiments are somewhat limited since for CheXpert, as the authors note, is not really optimal likely due to the model not being able to distinguish the demographic features from the sample itself. Thus, essentially the authors only validate their findings on one dataset. It would be great if the authors could extend their investigation to other datasets or demographic features which are indeed detectable by the VLMs.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

For the experiments the authors use the DDI dataset which contains data quality issues which should be removed in this setting.

Gröger, F., Lionetti, S., Gottfrois, P., Groh, M., Daneshjou, R., Consortium, L., Navarini, A. A., & Pouly, M. (2023). Towards Reliable Dermatology Evaluation Benchmarks.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

very interesting and high quality paper with useful insights for the MICCAI community
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

The paper is interesting to the community, however very limited in scope. I personally hoped that the authors would address this more in the rebuttal as almost all reviewers mentioned this. Yet they solely reported that it is indeed small and one could extend it without justifying their selection nor defending the comment about the unsuitedness of CheXpert due to their own findings that the models are not able to detect any demographics in the samples, making a further bias analysis on such a dataset nonsensical.

Author Feedback

We thank all the reviewers for their feedback. While we are unable to include new experimental results in the rebuttal per MICCAI guidelines, we hope the following clarifications strengthen the reviewers’ confidence in the work.

Dataset Concerns: We acknowledge concerns regarding dataset size, and can add a few lines to discuss these limitations. We thank the reviewers for bringing our attention to datasets we weren’t previously aware of (e.g. retinal fundus datasets), which could have provided a third clinical modality. For the following reasons, however, we believe our datasets remain adequate for demonstrating our intended effects:

A) While the DDI test set is small (104 samples), this is not unusual in medical imaging research. The official CheXpert test set includes only 8 positive Lung Lesion examples, 11 for Pneumonia, and 9 for Pneumothorax—yet remains widely accepted (3195 citations).

B) While Figure 5 (which looks at predictive performance within demographic subgroups) theoretically could be more impacted by small sample sizes, Figures 2–4 measure (conditional) average predicted outputs, not classification accuracy. These metrics should be relatively independent of the content of the images in the test set, since we emphasize the models’ relative propensity to predict a particular label as the frequency of that label in the prompt increases, rather than absolute predictive performance on those samples.

C) For CheXpert sampling: we filtered for frontal images, split by patients (rather than studies) to avoid test set leakage, and then randomly sampled balanced subsets (100 of each pos/neg for pneumothorax per sex in the demo set; 25 pos/neg per sex in the test set). We believe this random sampling reduced the risk of unintended sampling bias.

Prompt sensitivity: While we performed several internal tests with different prompt phrasings and structures that all showed the same general trends as our final versions, we did not perform systematic ablations on prompt structure, and we agree this would be a valuable direction for future work.

Regarding GPT-4o’s higher sex classification accuracy but reduced propensity to learn sex-pneumothorax associations: We hypothesize this relates to alignment or safety training that discourages demographically biased medical classifications. Claude can clearly infer sex from CXRs (evidenced by Fig 3g’s dependency pattern) but may be discouraged from using such inferences for medical predictions. We’ve noted this in discussion, though full investigation exceeds this paper’s scope.

Output calibration: We’ve added citations to the output calibration methods suggested by Reviewer #1 as applicable for mitigating identified biases, and will note the importance of investigating whether biases emerge with other adaptation methods like fine-tuning (with citations to the provided methods).

Model Selection: Most medical VLMs were excluded because they don’t support multiple interleaved images required for visual ICL. We attempted to use MedFlamingo (9B), but it struggled with batch inference formatting compared to larger commercial models. We’ve added to the Discussion that comparing this model to similar-sized open-source models (OpenFlamingo, Otter, Llava-OV) would be valuable future work.

Notation Clarification: We’ve clarified that “f(x)” represents the binary classification output, “g(x)” is an indicator function evaluating as 1 when sample x belongs to a particular demographic group (0 otherwise), and “y” is the disease label for image “x”.

# of replicates: When we analyzed our data with t-tests comparing prediction rates between experiments with more vs. fewer positive labels in prompts, most findings (e.g., Fig 3g) were statistically significant with three replicates, though some (e.g., Fig 3f) didn’t reach p<.05 despite visual trends. While additional replicates could clarify these cases, we believe our main results hold with the current num of replications.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

A timely and well-executed empirical study offering novel insights into demographic biases in in-context learning with vision-language models, meriting acceptance despite some limitations in scope and dataset size.

back to top

BiasICL: In-Context Learning and Demographic Biases of Vision Language Models

Author(s):