Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Recent advances in medical vision-language models (VLMs) demonstrate impressive performance in image classification tasks, driven by their strong zero-shot generalization capabilities. However, given the high variability and complexity inherent in medical imaging data, the ability of these models to detect out-of-distribution (OOD) data in this domain remains underexplored. In this work, we conduct the first systematic investigation into the OOD detection potential of medical VLMs. We evaluate state-of-the-art VLM-based OOD detection methods across a diverse set of medical VLMs, including both general and domain-specific purposes. To accurately reflect real-world challenges, we introduce a cross-modality evaluation pipeline for benchmarking full-spectrum OOD detection, rigorously assessing model robustness against both semantic shifts and covariate shifts. Furthermore, we propose a novel hierarchical prompt-based method that significantly enhances OOD detection performance. Extensive experiments are conducted to validate the effectiveness of our approach. The codes are available at \url{https://github.com/PyJulie/Medical-VLMs-OOD-Detection}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0815_paper.pdf

SharedIt Link: https://rdcu.be/eHwTj

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04971-1_13

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/PyJulie/Medical-VLMs-OOD-Detection

Link to the Dataset(s)

FIVES dataset: https://figshare.com/articles/figure/FIVES_A_Fundus_Image_Dataset_for_AI-based_Vessel_Segmentation/19688169/1 LC25000 dataset: https://github.com/tampapath/lung_colon_image_set COVID-19 dataset: https://www.kaggle.com/datasets/amanullahasraf/covid19-pneumonia-normal-chest-xray-pa-dataset?select=normal DeepDRiD dataset: https://www.kaggle.com/datasets/chopinforest1986/ultradeepdrid

BibTex

@InProceedings{JuLie_Delving_MICCAI2025,
        author = { Ju, Lie AND Zhou, Sijin AND Zhou, Yukun AND Lu, Huimin AND Zhu, Zhuoting AND Keane, Pearse A. AND Ge, Zongyuan},
        title = { { Delving into Out-of-Distribution Detection with Medical Vision-Language Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {133 -- 143}
}

Reviews

Review #1

Please describe the contribution of the paper

Medical vision-language models (VLMs) have shown promising ability to generalize in few-shot and zero-shot settings. The manuscript investigates out-of-distribution (OOD) detection for medical VLMs, which is crucial for ensuring the reliability of these models in real-world applications. It proposes a benchmark based on an aggregate of 4 datasets, where OOD samples are generated according to three different types of domain shift: semantic (anatomical/diagnostic differences), covariate (imaging protocol/scanner), and ImageNet (for far-OOD baseline). Additional hierarchical prompting strategy is introduced to improve OOD detection by refining the text captions with increasing specificity about the clinical context (diagnsis, lesion morphology, modality, anatomical context, etc). The authors evaluate 7 VLMs in an exhaustive comparison of OOD detection and then select three top performers for further evaluation using advanced methods (FLAIR, UniMedCLIP, and QuiltNet) in both the zero-shot and few-shot settings. No single method appears to dominate, although MCM stands out.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The manuscript is well motivated by understanding the limitations of VLMs and the need for OOD detection in medical applications.
- The proposed benchmark is comprehensive, covering a variety of datasets and OOD detection methods.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Limited novelty. The paper proposes a benchmark aggregated from existing datasets for measuring OOD detection performance of medical vision-language models.
- I could not find where hierarchical prompting is present in the evaluation. Was it used throughout?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The manuscript investigates a reasonably important problem, OOD detection for medical VLMs, and proposes a benchmark based on an aggregate of 4 datasets. However, the benchmark is not particularly novel, as it is based on existing datasets and simple transformations to generate OOD samples. The main novelty, hierarchical prompting, is not evaluated in a comprehensive manner. Instead, the evaluation focuses on reporting metrics for seemingly as many strategies and models as possible, without a clear focus on what these values mean or how they demonstrate the contribution of the paper.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

While I agree that OOD detection is important and the benchmark is worthwhile, I would respectfully maintain that the paper has limited technical/methodological novelty. Points (1) and (2) in the response on this point are related to the novelty of the benchmark, not the novelty of the method. New benchmarks are important, and they can make very useful contributions as a MICCAI challenge, but they are not a methodological novelty. Point (3) on this issue shows potential, yes, but is not substantial enough to make up for the lack of novelty in the benchmark.

Review #2

Please describe the contribution of the paper

This paper presents the first systematic study of out-of-distribution (OOD) detection for medical vision-language models (VLMs). It constructs a comprehensive full-spectrum benchmark including both semantic and covariate shifts, and evaluates a range of general-purpose and domain-specific medical VLMs. A hierarchical prompt-based method is proposed to enhance OOD separability, leveraging multi-level clinical semantics (e.g., disease, modality, anatomy). Experiments demonstrate consistent performance improvements across multiple models and datasets under both zero-shot and few-shot settings.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The topic of OOD detection in medical VLMs is timely and relevant.
- The benchmark covers realistic OOD types (e.g., modality and quality shifts) that are clinically meaningful.
- Experimental results are well-organized and include both zero-shot and few-shot baselines.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Insufficient methodological novelty: The proposed method mainly focuses on constructing hierarchical prompts, which is a form of prompt engineering rather than a substantial algorithmic or architectural innovation. There are no new scoring functions, model components, or learning paradigms introduced.
2. Reliance on LLM-generated prompts: The hierarchical prompts are generated using GPT-4o. While this leads to richer semantic inputs, the feasibility of replicating such prompts in real-world clinical settings is questionable without external LLM support.
3. Benchmark construction lacks diversity and validation: Although the benchmark includes multiple datasets, the number of domains and classes remains limited. There is no external expert validation of the difficulty or realism of the shifts, and far-OOD settings rely heavily on natural image datasets (e.g., ImageNet), which may not reflect true deployment scenarios.
4. Performance improvements are incremental and not universal: While some gains are reported, especially in covariate-shift settings, the overall improvements are not consistent across models and detection types. Some setups even degrade performance on far-OOD tasks (e.g., ImageNet).
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper addresses an important, underexplored topic with high practical relevance. The benchmark and prompting strategy are valuable contributions to the field, and the analysis is thorough. While the method is mainly built upon existing components and lacks deeper architectural or theoretical novelty, the work is likely to spark follow-up research in reliable medical VLM deployment. I recommend acceptance conditional on clarifying prompt generation feasibility and releasing reproducibility resources.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper “Delving into Out-of-Distribution Detection with Vision-Language Models” presents a systematic exploration of out-of-distribution (OOD) detection in medical vision-language models (VLMs), addressing both semantic and covariate shifts. The authors provide extensive comparisons and ablations across multiple datasets, models, and fine-tuning configurations, evaluating both in-distribution classification and the detection of various types of OOD data. Additionally, they propose a novel prompting strategy to enhance OOD detection performance.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper provides a clear and well-structured statement of contributions. The proposed hierarchical prompt diversification is novel and interesting. The study includes extensive benchmarking across various models and datasets.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Certain sections would benefit from additional clarification. Please refer to the comments
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- Table 2 presents the few-shot learning performance for OOD detection using the proposed hierarchical prompting strategy (Figure 2). Have the authors investigated the performance using simple prompts as a baseline? It would be valuable to disentangle the individual effects of few-shot training and the proposed prompting approach.
- Please clarify the stratification strategy used for the mentioned datasets. Did the authors ensure that data splitting was performed on a patient-level basis to prevent data leakage? Additionally, was a portion of the data reserved for validation and early stopping?
- The heading of Section 4.3, titled “Analysis,” is overly broad. Consider revising it to something more specific, such as “Ablation Studies,” to better reflect the content of the section.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Please refer to the comments
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have adequately addressed my concerns in the rebuttal. Provided that they incorporate the proposed changes and include the clarifications outlined in their response, I believe the paper would be suitable for acceptance.

Author Feedback

Reviewer 1:

Q1: The performance using simple prompt. A: Thank you for your point. We first evaluated various few-shot methods using simple prompts. Among these, we selected LoCoOp, which demonstrated averagely strong performance across the three OOD datasets. We then applied our proposed hierarchical prompts on LoCoOp (denoted with (L=5)) to assess the effectiveness of richer semantics.

Q2: The stratification, validation, and early stopping strategy. A: Thank you for your insightful question. We adhered to official train/test partitions from public datasets (FIVES, ISIC 2019, etc.), explicitly avoiding leakage. For consistency and fairness, we followed HGCLIP and trained all models for 50 epochs without applying early stopping across all evaluated methods.

Reviewer 3:

Q3. Benchmark lacks diversity and validation. A: Thank you for this thoughtful comment. Our benchmark is designed to reflect realistic clinical challenges by incorporating both semantic and covariate shifts across three distinct medical domains, which were specifically chosen because public medical VLMs are available for each. To ensure clinical relevance, we consulted medical experts when selecting OOD scenarios such as quality and modality shifts. Natural images are included as far-OOD primarily for completeness and consistency with prior work. Importantly, our methods do not rely on any retraining with OOD data for both zero-shot and few-shot applications.

Q4. Performance improvements are not universal. A: We appreciate this insightful observation. The inconsistent gains on OODs are mainly due to the pre-training of medical VLMs, which are often fine-tuned on domain-specific data and thus less effective at distinguishing natural images. However, our few-shot training strategy consistently improves performance on more clinically relevant and challenging cases, particularly covariate-shifted samples, which are known to be harder to detect. Overall, our method enhances robustness where it is most critical: in identifying subtle and clinically confusing OOD instances.

Reviewer 3 & 4:

Q1. Limited technical novelty. A: Thank you for pointing this out. We would like to clarify our main novelty as follows: (1) We are the first to benchmark full-spectrum OOD detection (semantic + covariate shift) in the medical VLM setting covering 3 medical domains, 7 medical VLMs, and 8 reproduced advanced methods. We propose a hierarchical prompt framework that supports both zero-shot and few-shot training to meet the demands of diverse scenarios. (2) Our benchmark is carefully designed in collaboration with clinical experts to simulate realistic deployment risks. We include semantic shifts, modality shifts, quality degradation, and cross-domain generalization, which are rarely captured together in prior OOD studies. (3) Improving CLIP’s OOD detection shows potential to reducing hallucinations in multimodal large language models (MLLMs), as CLIP often serves as the visual encoder for models like LLaMA. Enhanced OOD awareness enables MLLMs to express uncertainty from incorrect predictions, enhancing reliability especially for real-world medical deployments.

Q2. The use of LLM-generated prompts A: Thank you for the valuable point. To clarify, GPT-4o was used only once to generate hierarchical prompts for each category, covering basic medical concepts such as organs, diseases, and lesions with relevant clinicians verified. These prompts could also be manually written by medical professionals or extracted from clinical textbooks as alternative sources. The results indicate that enriching the semantic content of prompts improves OOD detection in both zero-shot and few-shot settings. All prompts used in this study will be publicly released with the code to ensure the reproducibility.

To address the issues mentioned above, we will carefully revise our manuscript. We sincerely hope that our clarifications and planned revisions will receive your kind consideration and support.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

It appears that the only reviewer positioned against acceptance relies heavily on a lack of novelty, the other concerns being mitigated by authors’ response. Since this is an OoD benchmark for VLMs, I don’t think we should focus too much on novelty, but rather on rigorous validation, so I’d rather recommend acceptance.

Also, I do not appreciate the authors using the “confidential comments” field to extend their rebuttal by summarizing it to me and the other meta-reviewers.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Two reviewers recommend acceptance, citing the practical relevance, thorough experimental design, and clarity of presentation. A third reviewer expresses reservations about methodological novelty, noting that the benchmark aggregates existing datasets and that the prompting strategy is a form of prompt engineering rather than a core architectural innovation.

While this concern is valid, it is important to note that novelty in MICCAI contributions does not need to arise solely from architectural changes. Well-motivated re-framings of existing tools, such as hierarchical prompting in a clinically meaningful setting, can constitute a meaningful contribution when they advance community understanding or practice.

That said, the paper somewhat overstates its claims. OOD detection in medical imaging is not a new problem, and several prior works have addressed it. Similarly, the benchmark is not specific to VLMs and does not constitute the first “full-spectrum” OOD benchmark [1]. For future work, I recommend that the authors include established OOD baselines (e.g., [2] – just to name one example; no connection to self) in their comparisons, particularly when the task is not exclusive to VLMs.

[1]: Hong, Zesheng, et al. “Out-of-distribution detection in medical image analysis: A survey.” arXiv preprint arXiv:2404.18279 (2024). [2]: Graham, Mark S., et al. “Unsupervised 3d out-of-distribution detection with latent diffusion models.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2023.

back to top

Delving into Out-of-Distribution Detection with Medical Vision-Language Models

Author(s):