Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Training data in the medical domain is often limited due to privacy concerns and data scarcity. In such few-shot settings, neural network models are prone to overfitting, resulting in poor performance on new in-distribution (ID) data and misclassification of out-of-distribution (OOD) data as learned ID diseases. Existing research treats these two tasks (few-shot learning and few-shot OOD detection) separately, and no prior work has explored a unified approach to simultaneously improving the performance of both tasks. To bridge this gap, we propose a novel framework based on CLIP that jointly enhances ID classification accuracy and OOD detection performance. Our framework consists of three key components: (1) a visually-guided text refinement module, which refines text representations of each disease utilizing disease-relevant visual information; (2) a local version of supervised contrastive learning, which enhances local representation consistency among disease-relevant regions while improving ID-OOD separability; and (3) a global and local image-text alignment strategy, which adaptively combines the global and local similarity measurements for better image-text alignment. Extensive experiments demonstrate that, our method outperforms the best methods specifically-tailored for both tasks, achieving new state-of-the-art performance. The source code will be publicly released.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0981_paper.pdf

SharedIt Link: https://rdcu.be/eHwTs

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04971-1_20

Supplementary Material: Not Submitted

Link to the Code Repository

https://openi.pcl.ac.cn/OpenMedIA/GLAli

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YanJie_Global_MICCAI2025,
        author = { Yan, Jie AND Guan, Xiaoyuan AND Zheng, Wei-Shi AND Chen, Hao AND Wang, Ruixuan},
        title = { { Global and Local Vision-Language Alignment for Few-Shot Learning and Few-Shot OOD Detection } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {208 -- 218}
}

Reviews

Review #1

Please describe the contribution of the paper
- To enhance the performance of Vision-Language Models (VLMs) in medical image classification under few-shot learning settings, this manuscript proposes improvements in three of the four fundamental components of a typical VLM architecture: the text encoder, the fusion module, and the pre-training objectives.
- Specifically, the text encoder is enhanced using a visually-guided text refinement module; the fusion module incorporates a global-local image-text alignment strategy; and the pre-training objective is improved by introducing a localized version of supervised contrastive learning
- In addition to improving few-shot learning performance on novel in-distribution samples, this manuscript also strengthens the model’s capacity to detect out-of-distribution data, effectively mitigating the misclassification of unseen inputs as previously learned categories
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
This manuscript proposes a novel CLIP-based training and inference framework designed for both few-shot medical image classification and out-of-distribution (OOD) detection under few-shot settings.
- Text Encoder (Refined textual representation of each disease) The text refinement module refines text embeddings using estimated disease relevant visual information
- Pre-trained Objective (Supervised local contrastive learning) The supervised local contrastive learning uses estimated disease-relevant and background local regions to enhance representation consistency within each class and to help improve OOD detection.
- Fusion module (Global and local image-text alignment) The proposed global and local image-text alignment achieves better alignment of local visual-textual sets, further improving ID classification and OODdetection
- Model training and Inference The proposed method has achieved new SOTA performance in both tasks on three medical benchmarks.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The manuscript does not address computational complexity.. For testing on medical images, why was CLIP chosen instead of MedVLM ?.
- The manuscript should clearly specify the chosen approach for few-shot learning, and how it compares to methods such as adapter-based learning and prompt learning.
- Please provide a more detailed explanation of how could few-shot learning and few-shot out-of-distribution (OOD) detection be performed?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- The manuscript does not address computational complexity.. For testing on medical images, why was CLIP chosen instead of MedVLM ?.
- The manuscript should clearly specify the chosen approach for few-shot learning, and how it compares to methods such as adapter-based learning and prompt learning.
- Please provide a more detailed explanation of how could few-shot learning and few-shot out-of-distribution (OOD) detection be performed?
- The manuscript should present the main algorithms to facilitate easy reproduction of results. It should also provide clear guidelines on how to select parameters such as the hyperparameter k, the weights λ₁, λ₂, λ₃, the fusion coefficient α, and the temperature scaling factors τ and τ.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

In this paper, the authors proposed a novel framework based on CLIP to enhance both few-shot classification and few-shot OOD detection. In the method, a visually-guided text refined module is proposed to optimize text embeddings, a supervised contrastive learning is employed to improve ID-OOD separability, and a global and local strategy is proposed to enhance image-text alignment.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Compared with the existing methods, this method has achieved good results in the experiment.
- Detailed ablation study is conducted in experiment.
- The paper is well written and the method is elaborated.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- In Fig.1(a), the bounding boxes do not seem to correspond to image patches, please explain it.
- In the experiment, ISIC8 was significantly less accurate than the other two datasets, why?
- How is the experimental data preprocessed? When k increases, is it possible for disease-irrelevant patch and disease-relevant patch to overlap? If so, how would that affect the results?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Please refer to strengths and weakness above.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper presents a few-shot learning method designed to enhance both in-distribution (ID) classification and out-of-distribution (OOD) detection. A key contribution is the proposal of a unified framework that addresses both tasks simultaneously. The approach integrates both global and local image-text alignment, motivated by the observation that critical diagnostic information in medical images is often localized to small regions. The method also introduces a visually-guided text refinement mechanism, which updates the textual representations based on disease-relevant visual regions, and a local version of supervised contrastive learning that uses both lesion-focused and background features to improve class discrimination and OOD detection.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is well-written and clearly presented, meeting the standard bar of MICCAI paper. The experiments are extensive with necessary ablation studies. It introduces a novel framework that is useful for MICCAI community in further advancing OOD detection in VLLM context. Since the research is rapidly moving in this direction, this contribution is timely. The paper makes a useful observation that critical diagnostic information in medical images is often localized to small regions.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The paper is generally well-organized and clearly written. However, some issues in the figures reduces the overall clarity. For example, in Figure 2, the results for the NPOS method (mentioned in the legend) are missing from the plot. Similarly, in Figure 1(b), there is repetation of index 0 in image and text representations.

The work does not provide the actual visualization of the embeddings similar to the one shown in Figure 1. (a). The argument against visual representation from the whole image should be properly supported by the t-SNE/UMAP visualization. Furthermore, providing proper ablation studies regarding contrastive learning is highly recommended.

It may be useful if authors could mention in the final camera-ready version why they opt for supervised contrastive loss instead of the works [1, 2, 3] proposed specifically for OOD detection.

[1] How to Exploit Hyperspherical Embeddings for Out-of-Distribution Detection? [2] Loss Reweighting for Distance-based OOD Detection [3] Provable Discriminative Hyperspherical Embedding for Out-of-Distribution Detection
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is empirically strong and includes the relevant baselines for the comparisons. The method is indeed novel. The strengths mentioned above outweigh the weaknesses.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

N/A

Meta-Review

back to top

Global and Local Vision-Language Alignment for Few-Shot Learning and Few-Shot OOD Detection

Author(s):