Abstract

Integrating image and text data through multi-modal learning has emerged as a new approach in medical imaging research, following its successful deployment in computer vision. While considerable efforts have been dedicated to establishing medical foundation models and their zero-shot transfer to downstream tasks, the popular few-shot setting remains relatively unexplored. Following on from the currently strong emergence of this setting in computer vision, we introduce the first structured benchmark for adapting medical vision-language models (VLMs) in a strict few-shot regime and investigate various adaptation strategies commonly used in the context of natural images. Furthermore, we evaluate a simple generalization of the linear-probe adaptation baseline, which seeks an optimal blending of the visual prototypes and text embeddings via learnable class-wise multipliers. Surprisingly, such a text-informed linear probe yields competitive performances in comparison to convoluted prompt-learning and adapter-based strategies, while running considerably faster and accommodating the black-box setting. Our extensive experiments span three different medical modalities and specialized foundation models, nine downstream tasks, and several state-of-the-art few-shot adaptation methods. We made our benchmark and code publicly available to trigger further developments in this emergent subject.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2320_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2320_supp.pdf

Link to the Code Repository

https://github.com/FereshteShakeri/few-shot-MedVLMs

Link to the Dataset(s)

https://physionet.org/content/mimic-cxr-jpg/2.1.0/ https://stanfordmlgroup.github.io/competitions/chexpert/ https://paperswithcode.com/dataset/nct-crc-he-100k https://data.mendeley.com/datasets/9xxm58dvs3/1 https://www.rsna.org/rsnai/ai-image-challenge/rsna-pneumonia-detection-challenge-2018 https://www.adcis.net/en/third-party/messidor/ https://www.kaggle.com/datasets/nitishsingla0/fives-dataset https://www.kaggle.com/datasets/tanjemahamed/odir5k-classification



BibTex

@InProceedings{Sha_Fewshot_MICCAI2024,
        author = { Shakeri, Fereshteh and Huang, Yunshi and Silva-Rodriguez, Julio and Bahig, Houda and Tang, An and Dolz, Jose and Ben Ayed, Ismail},
        title = { { Few-shot Adaptation of Medical Vision-Language Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents a benchmark for adapting medical vision-language models (VLMs) under few-shot learning setup. It also proposes a method which blends both vision and text embeddings to improve the linear probing performance. Essentially, the proposed linear probing baseline (LP+text) is a linear mixing of vision embedding and text embedding via learnable parameters. The benchmark evaluates the performance of three prompt learning methods, two black box adapter methods, and a linear probing baseline against the proposed method across nine datasets spanning three modalities.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper benchmarks 6 few-shot adaptation methods on 9 datasets covering 3 different modalities.
    2. The experiment results show the proposed LP+text stands out in most cases.
    3. The paper also demonstrates the proposed LP+text is much more computational efficient compared with other methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The LP+text baseline proposed in this study combines features from vision embedding and text embedding using a learnable multiplier. However, the rationale behind using a learnable multiplier for blending the embeddings is not immediately intuitive. Instead of employing a learnable multiplier, a more straightforward approach could involve concatenating the vision embedding and text embedding and then training a linear classifier or a multilayer perceptron (MLP) for few-shot adaptation. To address this uncertainty, conducting an ablation study would be beneficial in clarifying the effectiveness and necessity of the learnable multiplier in improving model performance.

    2. In the method section, several aspects are unclear and require further clarification: (a) The definition of class prototypes is ambiguous. Equation (1) suggests that they represent the linear layer for a specific class. However, the paper defines them as the last-layer weights of the vision encoder, which seems inaccurate. Isn’t the vision encoder θv frozen? Is w actually the last layer of θv? (b) The meaning of t in equations (1) and (2) remains unclear.

    3. The related work section merely lists existing works without providing a contrast with the proposed work. To enhance the section’s effectiveness, it would be beneficial to compare and contrast the proposed approach with the existing literature. This comparison could highlight the unique contributions, advantages, and potential limitations of the proposed method in relation to prior research, providing readers with a clearer understanding of its novelty and significance in the field.

    4. The experiments section mentions, “Since these datasets are also further used for evaluation, we pre-trained this model to better control test partition and ensure fairness.” Did the authors also pretrain MedCLIP on CheXpert and MIMIC-CXR? What is the rationale behind pretraining the model to improve test partition control and ensure fairness?

    5. For the datasets CheXpert_5×200, MIMIC_5×200, what is the meaning of “5×200”?

    6. The writing and organization of the method and experiments sections should be improved to make the paper more accessible, organized, and informative for readers.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to 6. weaknesses. There is also a typo in Abstract: “liner-probe”.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    he weaknesses section of the paper lists all my questions and doubts. To meet the acceptance criteria, these doubts need clarification and resolution.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper presents a an easy to implement, improved alternative to linear probing by integrating text knowledge. The approach is against prompt-learning and CLIP-based adapters on both performance and training time metrics.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The key novelty of the paper is an improved alternative to linear probing coupled with the more detailed analysis of the results against alternative methods. The paper is well written and does a great job at summarizing previous approaches.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Overall the paper is solid. The following are points that would help improve either the clarity or the claims made.

    In terms of the medical vision-language models for radiology” page 2, the references are models for chest X-rays. If the authors wish to reference models for radiology, a much broader set of references would be required.

    In terms of the chest X-Ray images, the paper would be improved by offering details on exactly how these images were prepared for training/inference. It is not explained in enough detail. For example, which DICOM tags were used to prepare the pixel data?

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Minor: These are listed above.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Minor: These are listed above.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, this research presents a promising approach for efficient and practical adaptation of medical VLMs in few-shot settings. The proposed LP+text method offers a compelling alternative to existing techniques, overcoming limitations through the presented experiments.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper introduces a new benchmark for few-shot adaptation of medical vision-language models, covering three different modalities, nine downstream tasks, and several state-of-the-art few-shot adaptation methods. Additionally, it proposes a generalization of linear probing, which adapts the last-layer weights (visual prototypes) of a pretrained vision encoder to a downstream task. This extension introduces learnable class-wise multipliers, which are optimized to blend visual prototypes and text embeddings effectively to maximize downstream task performance. The method adjusts the multipliers and visual class prototypes during training, keeping the vision and text embeddings fixed. This allows for model adaptation in a black-box setting, where only access to output embeddings is required. Empirical experiments demonstrate the potential of the proposed method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Timely Topic and Benchmark Utility: The efficient adaptation of foundation models in medical imaging is a timely topic, and the introduced benchmark could greatly benefit researchers in the MICCAI community. This is particularly valuable as it includes a variety of medical imaging modalities including Histology, Radiology, and Ophthalmology.

    2. Code Release: The released code can facilitate research in foundation models beyond few-shot adaptation. I am particularly excited that the authors chose to share their dataset preparation code in addition to the model construction and training scripts.

    3. Simple and Effective Method: The proposed method is straightforward yet effective, demonstrating high training efficiency when compared to well-established baselines.

    4. Evaluations are averaged across five random seeds, increasing the reliability of the results.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Questionable Realism of Few-Shot Settings: The paper claims to advance realistic evaluations of medical vision-language models (VLMs). However, the practical relevance of very low-shot settings, such as 1-shot, 2-shot, or even 4-shot adaptations, in real-world medical imaging is debatable. It seems plausible that many medical institutions could invest in collecting a slightly larger number of samples to train a more effective model, especially when accuracy is otherwise below a certain threshold. While I understand that up to 16-shot settings are common in the computer vision literature, it would be beneficial for the authors to specify which of these setups are realistic in real-world medical scenarios. For example, can the authors provide concrete scenarios where 16-shot tasks could arise in real-world medical imaging?

    2. Unclear Improvement Over Linear Probing in Ophthalmology: The improvement of the proposed method over linear probing (LP) in Ophthalmology is not clear, with LP even outperforming the proposed method in the 16-shot scenario. Although LP shows significant performance drops at S=1, the practicality of such a setting in medical contexts is questionable and appears to be more academically interesting than clinically relevant. It would be interesting to know the authors’ insights on why LP performs significantly better in Ophthalmology, and specifically in the FIVES dataset.

    3. Claims of Efficiency Need Clarification: When the paper makes statements about efficiency, such as “the most efficient method,” the authors should clarify that they are referring to training efficiency. An argument about why training time should be a priority for medical institutions in few-shot scenarios would be helpful, especially since the training duration for such models is generally short. In these cases, achieving higher performance might be more critical than optimizing training time.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The code seems very clean and extendable!

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Here are some additional comments to help the authors:

    Future and Relevant Work: I agree with the authors’ intuition that a limitation of linear probing (LP) is its failure to utilize information from the text encoder. Notably, there is highly related work that acknowledged this issue and proposed a simple approach for few-shot multimodal adaptation (vision-text) to enhance unimodal (vision-only) tasks [A]. As potential future work, exploring the benefits of this method within the proposed benchmarks could be valuable.

    Attribution of Linear Probing: When linear probing is first mentioned, it is attributed to the CLIP paper [19]. However, the practice of training a linear classifier on top of a feature extractor has been a standard in computer vision for some time and should not be credited to the CLIP paper, as this could mislead readers into thinking it originated there. For instance, the “Feature Transfer” section in the “Exploring the Limits of Weakly Supervised Pretraining” [B] paper discusses similar concepts. If the authors wish to emphasize that they are adopting the same linear probing settings as in [19], it would be more accurate to state this explicitly rather than suggesting that linear probing was “evaluated initially in the seminal CLIP paper [19]”.

    Unsure about the Paper’s Subtitle: Across various studied scenarios, particularly in radiology, all methods—including the proposed method, linear probing (LP), and prompting approaches—perform below expectations (achieving less than 65% accuracy even in 16-shot setups). This undermines the claim in the paper’s subtitle that “A Good Linear Probe Is All You Need.” Instead, it appears that all the methods currently fall short in effectiveness on the proposed benchmark.

    [A] Cross-Modal Few-Shot Learning with Multimodal Models, CVPR 2023 [B] Exploring the Limits of Weakly Supervised Pretraining, ECCV 2018

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main factors behind my accept recommendation are the paper’s introduction of a comprehensive new benchmark tailored for few-shot adaptation of medical foundation (vision-language) models, which could foster research in new useful methodologies. The proposed generalized linear probing method significantly innovates on existing adaptation techniques by introducing class-wise multipliers that enhance model adaptability, and it shows promising results. Moreover, the authors are sharing a well-structured codebase that will ensure reproducibility and facilitate further research.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

N/A




Meta-Review

Meta-review not available, early accepted paper.



back to top