Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Recent advancements in Contrastive Language-Image Pre-training (CLIP) have demonstrated notable success in self-supervised representation learning across various tasks. However, the existing CLIP-like approaches often demand extensive GPU resources and prolonged training times due to the considerable size of the model and dataset, making them poor for medical applications, in which large datasets are not always common. Meanwhile, the language model prompts are mainly manually derived from labels tied to images, potentially overlooking the richness of information within training samples. We introduce a novel language-image Contrastive Learning method with an Efficient large language model and prompt Fine-Tuning (CLEFT) that harnesses the strengths of the extensive pre-trained language and visual models. Furthermore, we present an efficient strategy for learning context-based prompts that mitigates the gap between informative clinical diagnostic data and simple class labels. Our method demonstrates state-of-the-art performance on multiple chest X-ray and mammography datasets compared with various baselines. The proposed parameter efficient framework can reduce the total trainable model size by 39% and reduce the trainable language model to only 4% compared with the current BERT encoder.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0704_paper.pdf

SharedIt Link: https://rdcu.be/dY6gd

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72390-2_44

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0704_supp.pdf

Link to the Code Repository

https://github.com/XYPB/CLEFT

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Du_CLEFT_MICCAI2024,
        author = { Du, Yuexi and Chang, Brian and Dvornek, Nicha C.},
        title = { { CLEFT: Language-Image Contrastive Learning with Efficient Large Language Model and Prompt Fine-Tuning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {465 -- 475}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper presents a novel language-image contrastive learning framework that integrates PEFT and prompt fine-tuning into language-image contrastive learning. The proposed pretraining framework significantly reduces the total number of trainable parameters compared to CLIP and other CLIP variants in the medical domain, with promising results evaluated on two datasets. It also introduces context-based prompts during fine-tuning, enhancing the model’s ability to generalize from clinical diagnostic data to simple class labels.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed framework is evaluated on two datasets using three tasks and compared with various baselines. The paper is well-organized, structured, easy to follow, and detailed for reproduction. The idea of reducing the model size is very popular now, as large-scale foundation models are consuming too many resources.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

It would be better to compare the proposed model with itself without PEFT (fully tuned BioMedLM + DiNOv2), to see how much performance is sacrificed in the trade-off for reduced parameters. More baselines should be included, such as PubMed CLIP and BioMed CLIP, especially since BioMed CLIP achieved over 75% accuracy in RSNA zero-shot classification. The model architecture in section 2.1 is described as GPT-2 and ViT; however, in Figures 2 and 3.3, it is actually BioMedLM and DiNOv2. These architectures are similar but varies in training data and strategy, which can be confusing. The models and methods used in this paper are reasonable, but they are known in the field.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

Add fully tuned BioMedLM + DiNOv2, Pubmed CLIP and BioMed CLIP as baseline for comparsion.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Reject — could be rejected, dependent on rebuttal (3)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Lack of strong baseline comparisons; Lack of novelty, as the modules in the proposed model are known in the field. Limited improvement compared to state-of-the-art baselines, with no significant impact on clinical applications.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Accept — should be accepted, independent of rebuttal (5)
[Post rebuttal] Please justify your decision

My concerns addressed with reasonable explanations.

Review #2

Please describe the contribution of the paper

To address challenges, such as data scarcity and the need for high computing resources, this paper proposes the CLEFT framework, which combines efficient large language models (LLMs) with dynamic prompt fine-tuning to enhance the performance of medical image analysis. By fine-tuning only a small set of model parameters and using a context-based approach for prompt generation, CLEFT achieves significant improvements in parameter efficiency and adaptability to various medical datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The utilization of a billion-parameter LLM, adapted through parameter-efficient fine-tuning techniques, allows CLEFT to enhance the text encoder’s performance without the substantial computational cost.
2. Context-based prompt tuning helps bridge the gap between generic language understanding and specific medical imaging tasks.
3. Extensive experiments have been conducted to validate CLEFT across multiple chest X-ray and mammography datasets, demonstrating superior performance over existing methods in accuracy and efficiency.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. While this approach significantly reduces the computational load, the complexity of integrating multiple techniques (e.g., LLM, PEFT, and dynamic prompts) might pose challenges in terms of implementation and tuning.
2. In motivation, it is mentioned that small dataset size is a common problem in medical image analysis. However, there is no experiment to discuss the dataset-efficient ability of the model.
3. Experiments seem to be unfair as different backbones and initialization compared with baselines shown in Table 1. As for backbone, the proposed method is based on ViT, while others are based on ResNet50 or Swin-Transformer. As for initialization, the proposed method is based on BioMedLM-3B and DiNov2, while others are not.
4. It is suggested to discuss the interpretability and explainability of the model, such as showing the activation of region of interests in original image through GradCAM or some else technologies.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

See section 6 “main weaknesses of the paper”
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The efficiency and improved performance seem to be promising.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Weak Accept — could be accepted, dependent on rebuttal (4)
[Post rebuttal] Please justify your decision

Keep previous score.

Review #3

Please describe the contribution of the paper

Modifying the well-known CLIP architecture to exploit medical LLMs into it. Learning context-based prompt with prompt fine-tuning.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

State-of-the-art performance of the proposed method is demonstrated in comparison to existing baselines. The paper and experiments are technically thorough. An ablation study analyses the individual method components and their contributions. Code will be made publicly available.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Lack of clear method explination: the authors claim that they proposed a PEFT method as per my understanding, however they use the lora method and report their SOTA results based on that, please clarify the approach.
- Limitted ablation experiment: Althought the authors propose some ablation experiments, more diverse ones would be needed, for example the usage of GPT2, BioMedLM… how did the authors came to the idea of using these networks?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

No comments.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

Please see section 6.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Please see section 5,6.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Weak Accept — could be accepted, dependent on rebuttal (4)
[Post rebuttal] Please justify your decision

The authors addressed several of my concerns, which led me to lean towards a weak acceptance of the manuscript.

Author Feedback

We thank the reviewers for noting our strengths (proposed model’s efficiency while incorporating LLM, extensive experiments with superior results, clear paper) and respond to critiques/misunderstandings.

Implementation challenges (R1): We agree usability is critical, hence “All code and pre-trained models will be made publicly available on GitHub” (footnote, p.1) and we will include detailed training settings in the online documentation.

Dataset efficiency (R1): Please note dataset efficiency (finetuning with only 1% and 10% of the data) results were shown in Supplement Table S1 and S2 and discussed on p.7: “Notably, our model also outperforms the baselines even with much less training data. We highlight that for RSNA, our accuracy drops by < 3% when using 1% compared to 100% of the training data.” As seen in S1 and S2, ours performed best under 1% and 10% data settings for both chest datasets.

Baselines use different backbones/initializations (R1): SOTA baselines are from published studies that use the same datasets as ours, thus the different architectures. These prior works used smaller architectures, but also initialized with pretrained models (eg BioClinicBERT, DieT). Introducing a LARGE-scale pretrained model with PEFT is our main contribution, in which even with fewer trainable parameters than the baselines (Table S3), we show better performance (Fig. 1, Table 1).

Model explainability (R1): We will apply model explanation like GradCAM and show examples in the Supplement.

Compare fully tuned BioMedLM+DiNOv2 (R3): Please note this comparison was already shown in Table 3 (bottom row, only 0.29% better on CheXpert) and discussed on p.8: “The fully fine-tuned model improves the performance even with a much smaller batch size; however, this improvement comes with the cost of ∼4 times more GPU memory cost and only 1/20 batch size with 2 times longer training.”

Include PubMed CLIP, BioMed CLIP baselines (R3): Please note PubMed CLIP and BioMed CLIP are general vanilla CLIP models, but trained with much larger datasets derived from PubMed, with multiple imaging modalities and anatomy targets. Our experiments focus on chest-specific tasks and are only trained with the CheXpert dataset. Following previous work (eg [12,26,27,29]), we only compare with methods trained on Chest data, since comparison with baselines trained with more and different data would be unfair.

Model architecture confusion (R3): In Sec 2.1, we described the model architecture, ie GPT-2 and ViT. In Sec. 3.3, we described the model implementation, noting “We choose BioMedLM-3B and DiNOv2 to initialize our encoders” (p.6). Fig. 2 is labeled with the implemented encoders.

Modules are known to the field (R3): While the individual components are adapted from prior work, we are the first to introduce LLM efficiently to this task. Our study shows a promising path of combining LLM with medical image pretraining when data and computational resources are limited, which is critical in the medical imaging domain. Also, our validation and superior performance on large open datasets (CheXpert, RSNA, EMBED) provide new baselines against which future work can be compared.

Proposed PEFT/LoRA confusion (R4): We clarify here we did not propose a novel PEFT method, but rather “we introduce the parameter-efficient fine-tuning (PEFT) module to the frozen LLM” (p.4), where “For the PEFT module, we experiment with LoRA [11], IA3 [16], and prefix fine-tuning [15]” (p.5). This benchmarks different PEFT methods in the medical domain and shows LoRA performed best (Table 1).

Ablation/Why use GPT2, BioMedLM (R4): We will evaluate the influence of different backbones in future work. Our goal here was to utilize large-scale pretrained models, which can help reduce the need for large size training data and improve performance (Sec 1). Specifically, we chose BioMedLM as it was the best performing 3B level LLM according to multiple benchmarks (Chen et al., “MediTron-70B,” arXiv 2023).

Meta-Review

Meta-review #1

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

N/A

back to top

CLEFT: Language-Image Contrastive Learning with Efficient Large Language Model and Prompt Fine-Tuning

Author(s):