Abstract

The success of Large Vision Models (LVMs) is accompanied by vast data volumes, which are prohibitively expensive in medical diagnosis. To address this, recent efforts exploit Parameter-Efficient Fine-Tuning (PEFT), which trains a small number of weights while freezing the rest for knowledge transfer. However, they typically assign trainable weights to the same positions in LVMs in a heuristic manner, regardless of task differences, making them suboptimal for professional applications like medical diagnosis. To address this, we statistically reveal the nature of sparsity and hybridity during diagnostic-targeted fine-tuning, i.e., a small portion of key weights significantly impacts performance, and these key weights are hybrid, including both task-specific and task-agnostic parts. Based on this, we propose a novel Sparsity- and Hybridity-inspired Parameter Efficient Fine-Tuning (SH-PEFT). It selects and trains a small portion of weights based on their importance, which is innovatively estimated by hybridizing both task-specific and task-agnostic strategies. Validated on six medical datasets of different modalities, we demonstrate that SH-PEFT achieves state-of-the-art performance in transferring LVMs to medical diagnosis in terms of accuracy. By tuning around 0.01% number of weights, it outperforms full model fine-tuning. Moreover, SH-PEFT also performs comparably to other models deliberately optimized for specific medical tasks. Extensive experiments demonstrate the effectiveness of each design and reveal the great potential of pre-trained LVM transfer for medical diagnosis.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1676_paper.pdf

SharedIt Link: https://rdcu.be/dV182

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72086-4_59

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1676_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Liu_Sparsity_MICCAI2024,
        author = { Liu, Mingyuan and Xu, Lu and Liu, Shengnan and Zhang, Jicong},
        title = { { Sparsity- and Hybridity-Inspired Visual Parameter-Efficient Fine-Tuning for Medical Diagnosis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {627 -- 637}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper introduces Sparsity- and Hybridity-inspired Parameter Efficient Fine-Tuning (SH-PEFT), a novel method designed to optimize the efficiency of fine-tuning large vision models (LVMs) specifically for medical diagnosis applications. By strategically selecting and training only a tiny fraction of the model’s weights deemed most critical through a combination of task-specific and task-agnostic strategies, SH-PEFT dramatically reduces the need for extensive computational resources. Validated across six different medical datasets, SH-PEFT not only outperforms traditional full model fine-tuning in terms of accuracy but also holds its own against other models tailored for specific medical tasks.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Innovative Approach: The paper presents a unique method that combines task-specific and task-agnostic strategies to determine the most influential weights, addressing a common inefficiency in standard parameter-efficient fine-tuning practices.
2. Highly Efficient: The efficiency of SH-PEFT is commendable, achieving significant improvements in model performance by adjusting only about 0.01% of the total weights.
3. Strong Empirical Results: The method is extensively tested across multiple datasets, demonstrating superior performance over traditional fine-tuning methods and comparable outcomes to specialized models.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. I have a question about Fig.3. After obtaining the task-specific and task-dependent key weights respectively, why is the intersection of the two directly selected as the weight that needs tuning? Is it possible to take a union here?
2. There are no in-text links for references, figures, tables, and equations.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

/
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

Please refer to the weakness section.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Please refer to the strength section.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The paper introduces a Parameter-Efficient Fine-Tuning (PEFT) method called Sparsity- and Hybridity-inspired Visual Parameter-Efficient Fine-Tuning (SH-PEFT) for adapting large vision models to medical diagnosis tasks. SH-PEFT effectively selects and trains a small subset of crucial weights within a pre-trained model based on their estimated importance, which is determined through a hybrid strategy that considers both task-specific and task-agnostic contributions. This approach is demonstrated to achieve superior performance on several medical datasets compared to existing PEFT methods, using only a minimal fraction of trainable parameters.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Uniquely blends task-specific and task-agnostic strategies to estimate the importance of weights. This allows for selective tuning that is more suited to the task at hand, particularly in professional applications like medical diagnosis.
2. The method is grounded in a thorough statistical analysis that identifies key weights that significantly impact performance. This analysis supports the sparsity and hybridity approach, guiding the efficient selection of weights to be tuned.
3. SH-PEFT demonstrates state-of-the-art performance on multiple medical datasets while tuning only about 0.01% of the model’s weights. This represents a significant reduction in computational resources and training time compared to full model fine-tuning. 4.Tested and validated across six diverse medical datasets, showing its adaptability and robustness across different types of medical diagnostic tasks.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The paper does not provide access to the implementation code, which can limit the reproducibility of the results and hinder the ability for others to verify and build upon the work.
2. The research only utilizes one pre-trained large vision model, CLIP, for all experiments. This reliance on a single model type may not fully demonstrate the generalizability of the SH-PEFT method across different large vision models, which could behave differently under the same fine-tuning strategy.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
1. Providing access to the implementation code would greatly enhance the reproducibility of your results.
2. While your results with the CLIP model are impressive, testing SH-PEFT across a variety of pre-trained large vision models could strengthen your claims about its generalizability and effectiveness. 3.You could improve the explainability of your model by applying techniques such as LIME or SHAP to clearly understand how different features or parts of the model influence its predictions.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper introduces an innovative fine-tuning method, SH-PEFT, which demonstrates efficient parameter usage and potential for medical diagnostics. However, my reservation stems from the lack of provided implementation code and the study’s reliance on a single model, limiting reproducibility and generalizability.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This article proposes a new approach to finetuning a Vision Transformer (ViT) type neural network by training only a very small portion of the model’s parameters. The authors highlight the statistical phenomena of sparsity (only a limited number of weights is significant between a pretrained model and a finetuned model) and hybridity (most of these weights are not the same depending on the task, but a small fraction remains common). This latter observation allows the authors to propose a weight selection based on a task-agnostic and a task-specific selection. By doing so, even by retraining only a small part of the selected weights, the model achieves better performance than all the compared baselines. Six databases from very different tasks are used to assess the method.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This article is enjoyable to read, both in content and style, clearly presenting its motivations, methodology, and results. Figure 2 remarkably summarizes the preliminary results obtained, justifying the proposed methodology. The results, while part of a rich literature on the subject (well introduced by the authors), are interesting: by training only 1% of the model’s weights, the authors achieve better performance than a fully finetuned model. The weight selection technique also demonstrates its effectiveness compared to similar approaches from the very recent state of the art.

The experimental results are very conclusive, thanks to a well-chosen use of very diverse data and a relevant comparison with recent models. The ablation study clearly highlights the relevance of the different choices made by the authors in their algorithm.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The novelty is relatively limited compared to reference [11], but the methodological contribution remains relevant nonetheless.
- Only one architecture is tested, which is of reduced size, whereas the introduction mentions “Large Vision Models” suggesting a broader spectrum of application. It would be interesting to check whether the results obtained can be extended to other architectures (or even to CNN models).
- The authors train their model using SGD, whereas solvers like Adam, Adagrad, etc., exist which offer dynamic and weight-specific learning rates. In a very simplified way, the learning rate is adjusted based on the previously obtained gradient for each weight. In this sense, one can see it as a more general version of Equation 2, where M_{m,n} would not be binary and dynamic. After all, M_{m,n} is also built from the magnitude of previous gradients. Can you evaluate and comment on the interest that the SH-PEFT approach presents in comparison to these solvers?
- The demonstration of the performance achieved with the method is convincing in my opinion, but it also raises a number of questions about the approach. In particular, it would be interesting to mention the speed of convergence (in terms of iterations) when such a small number of parameters is retained. If the number of parameters kept accelerates the training, it would provide a good justification on why the authors chose such a low threshold of 1% parameters kept (why not 10%, 20% of even 50%?).
- There are minor typos in the manuscript (the abstract mentions 0.01% of weights finetuned versus 1% in the text), the balancing term of the second term of Equation 1 is incorrect (∑Imn / ∑Imn) (just before the sentence “After estimating the importance of all parameters, we set a threshold […]”)
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

I would suggest to release the code upon publication.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

Given to the space limitation, I do not expect a full comprehensive evaluation of the approach on different architectures. But I believe it would be interesting to measure how the approach scales with larger model and I suggest to analyze it in a future work.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

In my opinion, this is a very interesting contribution, backed by a very solid experimental validation.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Author Feedback

We would like to thank the meta-reviewer and the reviewers (R1, R3, R6) for their constructive comments. We are encouraged by their praise for the novelty (R3, R6), the effectiveness of our method (R1, R3, R6), the extensive validation with good performance (R1, R3, R6), and the clear organization of this work (R1). We have carefully studied the comments and have then responded to them.

Generalization to other architectures (R1, R6) Thanks for the constructive comments. Given the limited space and the extensive number of experiments required for multi-backbone evaluation, we have not yet verified the generalization of our SH-PEFT to other architectures. In future work, we will further refine our SH-PEFT method and demonstrate its effectiveness across different Large Vision Models (LVMs), including those based on both Transformer and CNN.

In Fig.3, why intersection rather than union? (R3) We would like to clarify that Fig. 3 depicts Formula 1, where the tunable weights are jointly determined by the weight importance estimated by both task-specific and task-agnostic methods. Rather than intersection or union, a hyperparameter, lambda, is used to weigh the two importances of each weight and select top-k important weights for tuning. The impact of lambda is shown in Table 3. We will further polish Fig. 3 for clarity, by showing that the selection is not based on intersection but on top-k weight selection.

Differences with solvers (R1) In comparison with solvers SH-PEFT is interested in adapting LVMs to downstream tasks in a parameter-efficient manner. Specifically, it offers two main advantages: 1) Reduced storage burden: Solvers train the entire model, which requires storing a separate LVM for each downstream task. In contrast, SH-PEFT only records parameter changes at a few key locations. When dealing with multiple downstream tasks, it significantly reduces the storage burden. 2) Greater memory efficiency: Solvers record gradients for each parameter during training. Differently, PEFT solutions, through proper optimization, only record a portion of these gradients during training. This results in the use of a smaller amount of GPU memory, enabling the fine-tuning of large models with limited resources.

Comparison to SPT [11] (R1) The distinctions between our SH-PEFT and SPT are discussed in the final paragraph on page 5, including: 1) We extend the scope of weight selection for enhanced flexibility. 2) We contribute a new strategy for estimating weight importance. 3) SH-PEFT performs better when applied independently.

Speed of convergence & Number of parameters kept (R1) Maintaining a small number of trainable weights can accelerate the convergence of the model. Based on our previous experimental results, we found that full model fine-tuning continues to improve results from 20k to 40k iterations, but there is no significant improvement from 40k to 60k iterations. As for PEFT methods, most models, including ours, converge before 20k iterations, and subsequent training does not enhance performance. Regarding the proportion of selected trainable weights, we primarily consider the comparability with other PEFT methods. As shown in Fig. 4, some existing methods maintain about 1% of the weights. Therefore, in our experimental design, we use 1% to demonstrate that our method is more effective when compared under a similar number of trainable weights.

Typos and formation (R1, R3), Code Release and future work (R1, R6), Model explainability (R6) Thank you for your constructive comments. In the camera-ready version, we will carefully review our article’s content, correct typos, and add in-text links according to the MICCAI format. Currently, due to the space limitation of the conference, the model’s interpretability has not yet been examined. We will further add analyses for model interpretability, improve our method, and make the code publicly available in our future work based on this version.

Meta-Review

Meta-review not available, early accepted paper.

back to top

Sparsity- and Hybridity-Inspired Visual Parameter-Efficient Fine-Tuning for Medical Diagnosis

Author(s):