Abstract

While Large Language Models (LLMs) excel in world knowledge understanding, adapting them to specific subfields requires precise adjustments. Due to the model’s vast scale, traditional global fine-tuning methods for large models can be computationally expensive and impact generalization. To address this challenge, a range of innovative Parameters-Efficient Fine-Tuning (PEFT) methods have emerged and achieved remarkable success in both LLMs and Large Vision-Language Models (LVLMs). In the medical domain, fine-tuning a medical Vision-Language Pretrained (VLP) model is essential for adapting it to specific tasks. Can the fine-tuning methods for large models be transferred to the medical field to enhance transfer learning efficiency? In this paper, we delve into the fine-tuning methods of LLMs and conduct extensive experiments to investigate the impact of fine-tuning methods for large models on the existing multimodal model in the medical domain from the training data level and the model structure level. We show the different impacts of fine-tuning methods for large models on medical VLMs and develop the most efficient ways to fine-tune medical VLP models. We hope this research can guide medical domain researchers in optimizing VLMs’ training costs, fostering the broader application of VLMs in healthcare fields. The code and dataset have been released at https://github.com/TIMMY-CHAN/MILE.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0640_paper.pdf

SharedIt Link: https://rdcu.be/dV17a

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72086-4_11

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0640_supp.pdf

Link to the Code Repository

https://github.com/TIMMY-CHAN/MILE

Link to the Dataset(s)

https://github.com/TIMMY-CHAN/MILE

BibTex

@InProceedings{Che_Can_MICCAI2024,
        author = { Chen, Jiawei and Jiang, Yue and Yang, Dingkang and Li, Mingcheng and Wei, Jinjie and Qian, Ziyun and Zhang, Lihua},
        title = { { Can LLMs’ Tuning Methods Work in Medical Multimodal Domain? } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {112 -- 122}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper explores whether tuning methods developed for large language models are effective for medical vision-language models. Through extensive experimentation, it demonstrates that certain PEFT techniques like LoRA and Prefix-Tuning can adapt well to the medical domain, while instruction-tuning methods for basic VLMs in practical tasks are less effective in medical domain.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Conducts an extensive evaluation of PEFT methods in the medical domain. Investigates how instruction-tuning affects the fine-tuning of basic VLP models and points out an effective way to maximize performance gains.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    As a systematic investigation, the use of only one baseline model (MISS) may not be sufficient to draw generalizable conclusions. The experiment results primarily focus on loss and accuracy (ACC); additional metrics should be used for closed-ended QA, and there is a lack of qualitative evaluation for open-ended QA.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Include additional baseline models with different image and text encoder architectures to enhance the robustness and generalizability of the findings. Add metrics such as Precision, Recall, and F1-Score for closed-ended QA, and qualitative evaluations for open-ended QA.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Insufficient experiments lead to limited conclusions for a systematic review paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    My concerns addressed with reasonable explanations



Review #2

  • Please describe the contribution of the paper

    This paper investigates efficient fine-tuning strategies for Large Vision-Language Models (LVLMs) in the medical multimodal domain. Introducing a design called Modularized medIcal Vision-Language fine-tuning modEl (MILE), the authors demonstrate that the parameters of the visual encoder are pivotal for enhancing the performance of VLMs. The study highlights the efficacy of LoRA-Tuning and Prefix-Tuning, which deliver comparable results to global fine-tuning models while reducing training costs by 40%.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper tackles the significant challenge of adapting large foundational models to medical applications, addressing a critical issue in the field.
    2. The paper is well-organized with a logical flow that facilitates understanding. The visualization including tables and figures are clearly presented and effectively helpful in explaining key points.
    3. The authors provide a thorough ablation study of various fine-tuning methods, offering a detailed comparison of their impacts on model performance.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. While the paper provides valuable insights into efficient tuning strategies, it falls short in guiding future research. There are no clear suggestions or frameworks laid out for applying these insights to further innovations or studies.
    2. The exclusive use of the medical VLM, MISS, as the baseline model restricts the generalizability of the findings. Including popular general-domain LVLMs such as LLaVA and BLIP-2 could have provided a more robust validation of the proposed strategies across different model architectures, better aligning the study with its stated goals.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See weaknesses.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although this paper provides some insights on fine-tuning the current LVLMs in medical domain, it still miss some parts of investigation on the general-domain LVLMs adapting to medical domain.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper develops methods for fine tuning large visual language models in a parameter efficient way in medical image analysis. The authors studied different methods of parameter efficient fine tuning (LoRA, adapter tuning, prefix tuning) for one backbone model (MISS generative medical VLM). Experiments demonstrated that a VLM can be fine-tuned for downstream tasks with reduced parameters and data requirements.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strengths of the paper include:

    • exploring many different parameter efficient fine tuning methods;
    • studying the impact of data level fine-tuning and the effectiveness of PEFT method;
    • providing strong evaluations for the proposed method;
    • The results are clearly explained and discussed.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    One main weakness is the limited comparison of model performance, only with variants of the proposed model. It’s unclear if the model outperforms existing work. For instance, in Table 5, different types of models are listed without clear comparison. Results of other models on the curated dataset aren’t shown. Application of the model to a downstream task isn’t evaluated. Lastly, the effectiveness of instruction tuning isn’t clear; it only surpasses other models when combined with global fine-tuning.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors discuss a unified approach for fine-tuning large visual language models for medical image tasks. They use a generative multimodal VLM as the backbone and implement parameter-efficient fine-tuning methods like LoRA and adapter training to reduce parameters. This work is significant as it demonstrates how such methods can be applied in the medical domain, where data is scarce and fine-tuning large models is challenging. The authors analyze various aspects of their approach, including different types of parameter-efficient fine-tuning and mechanisms like global vs. instruction-based fine-tuning.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper addresses the significant challenge of fine-tuning large visual language models for medical tasks. The authors offer a comprehensive analysis of their model’s various designs and meticulously evaluate its performance. While the comparison with other models is lacking, the proposed approach is sufficiently important and innovative.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Thanks for authors rebuttal, it addressed my concerns.




Author Feedback

Common Question: Why choose only MISS as the baseline model? Common Response: Thanks for the question! In related research, most medical small-scale VLMs have been classification or ranking models. But in real medical application scenarios, there are often no answer candidates available. MISS was the only small-scale generative Med-VLM at the time of writing, showing excellent performance on VQA benchmarks. So, we conducted extensive studies on different PEFT methods (46 experiments shown in Tables.1-4 and Appendix-Tables.1-2.) based on MISS, adjusting the model structure, combining PEFT units with each components, thereby proving the effectiveness of our method.

R4: Q1: Baseline A1: Please see the “Common Response”. Q2: Evaluation Metric A2: Due to the special requirements of medical tasks, the accuracy (ACC) is the ONLY GOLD STANDARD for evaluating model in the Med-VQA, which is different from the general domain. Incorrect diagnostic results will bring catastrophic results for patients. Whether for closed-ended or open-ended questions, ACC is the most reliable gold standard. In the recent research on Med-VQA, ACC is used as the sole evaluation metric, such as the studies accepted by [MICCAI 2022] and [MICCAI 2023], proposing M3AE and MUMC. To maintain consistency with prior work and facilitate comparison, we use ACC as the sole evaluation metric. However, we acknowledge the reviewer’s suggestion to use more metrics. If our paper is accepted, we will include these metrics in the APPENDIX. Q3: Insufficient Experiment? A3: In Sec.4 and Appendix-Sec.2, we present the results of experiments that use different PEFT methods and model architectures on two datasets, totaling 46 experiments, as shown in Tables.1-4 and Appendix-Tables.1-2. As you mentioned in the “Describe the Contribution” section, through extensive experiments,. Specifically, (i) we demonstrated the effectiveness of PEFT methods, as shown in Fig. 2(a); explored the optimal combination method of PEFT and Med-VLMs, as shown in Tables.1-4, and (iii) explored the impact of instruction-format data on fine-tuning Med-VLMs. The sufficient experiments contradict with your major factors of the score. Thanks for your review. We hope you will reconsider your evaluation of this study based on the responses provided above.

R5: Q1: Guidance for Future Research A1: Thanks for the comments! In Sec.3, we constructed a unified model, MILE. Based on experiments, we proposed the most effective fine-tuning method, MILE-LoRA, and explored the optimal combination of PEFT units and Med-VLMs. As [R4] noted, we point out an effective way to maximize performance gains. As [R6] noted: “This work is significant as it demonstrates how such methods can be applied in the medical domain, where data is scarce and fine-tuning large models is challenging.” This provides valuable guidance for future research. Q2: Baseline A2: Please see the “Common Response”. Q3: Need Validation on Popular General Domain LVLMs (LLaVA & BLIP-2) A3: We would like to clarify that this paper focuses on small-scale medical Med-VLMs rather than general LVLMs. (i) For many researchers, there are limited resources for LVLM fine-tuning. Our pioneering work is of great significance; (ii) Recent studies accepted by [MICCAI 2022] and [MICCAI 2023], proposing M3AE and MUMC, have both shown that Med-VLMs must be pre-trained on medical domain data. Otherwise they perform badly. So, validating our method on general-domain LVLMs such as LLaVA and BLIP-2 would be meaningless. We sincerely hope that after reading our response, the reviewer will consider raising the score. R6: Thank you for your approval and advice on our study! Q1: Effect of Instruction Fine-tuning A1: In Sec.4, we showed fine-tuning Med-VLMs using only instruction data significantly degrades model performance, while the data can improve a global fine-tuned model that has already converged on the original data, and achieve the SOTA results. Thank you again!




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After the rebuttal, two reviewers recommended acceptance. As an analytical paper, despite some shortcomings, I believe its strengths outweigh its weaknesses, and it should be accepted.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    After the rebuttal, two reviewers recommended acceptance. As an analytical paper, despite some shortcomings, I believe its strengths outweigh its weaknesses, and it should be accepted.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top