Abstract

Parameter-efficient transfer learning (PETL) is proposed as a cost-effective way to transfer pre-trained models to downstream tasks, avoiding the high cost of updating entire large-scale pre-trained models (LPMs). In this work, we present Fine-grained Prompt Tuning (FPT), a novel PETL method for medical image classification. FPT significantly reduces memory consumption compared to other PETL methods, especially in high-resolution input contexts. To achieve this, we first freeze the weights of the LPM and construct a learnable lightweight side network. The frozen LPM takes high-resolution images as input to extract fine-grained features, while the side network is fed low-resolution images to reduce memory usage. To allow the side network to access pre-trained knowledge, we introduce fine-grained prompts that summarize information from the LPM through a fusion module. Important tokens selection and preloading techniques are employed to further reduce training cost and memory requirements. We evaluate FPT on four medical datasets with varying sizes, modalities, and complexities. Experimental results demonstrate that FPT achieves comparable performance to fine-tuning the entire LPM while using only 1.8% of the learnable parameters and 13% of the memory costs of an encoder ViT-B model with a 512 x 512 input resolution.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2750_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/YijinHuang/FPT

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Hua_Finegrained_MICCAI2024,
        author = { Huang, Yijin and Cheng, Pujin and Tam, Roger and Tang, Xiaoying},
        title = { { Fine-grained Prompt Tuning: A Parameter and Memory Efficient Transfer Learning Method for High-resolution Medical Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces a new parameter and memory efficient fine-tuning method Fine-grained Prompt Tuning (FPT), aiming to enhance the effectiveness of parameter-efficient fine-tuning for medical images in high-resolution contexts. Basically, the proposed method attempts to (i) efficiently extract fine-grained information from high-resolution images and (ii) effectively adapt pre-trained knowledge from large-scale pretrained models. By conducting experiments on four public medical image datasets, the paper demonstrates the effectiveness of the method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is well structured with detailed information.

    • The paper provides codes for reimplementation.

    • The paper shows good balance between effectiveness and efficiency.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • My main concern is that I could find any specific designs of the proposed FPT that clearly targets medical imaging, given the content of Section 2. In other words, this work makes no difference from other regular approaches to natural images such as LoRA, while just conducting main experiments on medical image datasets.

    • More behavior analysis is needed to better explain the effectiveness of each main component of FPT.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to the weaknesses.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In the venue of MICCAI, it is expected to see some methods that specifically designed for medical imaging, instead of the ones that also work well on medical imaging. I would suggest this work to go for CV/AI venues by adding more experiments on natural images.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose a new parameter-efficient fine-tuning method, Fine-grained Prompt Tuning to reduce the memory consumption on high resolution images. They propose a lightweight side network to avoid back-propagating gradients to the backbone. They use low resolution input to enhance the performance. They design a token selection mechanism and a trainable prompt guided attention mechanism to reduce the number of tokens.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed method is novel. It uses a side network to avoid back-propagation on the backbone. The mechanism of trainable prompt guided attention reduces the number of tokens and is also very interesting. The memory reduction of the method is also impressive.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. This framework looks like a separate network to gather important features at various resolution/layers to make a prediction. I doubt whether it can be called tuning.
    2. What about fine-tuning the entire LPM and the proposed side network? Will it improve the performance a lot compared to the “Full fine-tuning” baseline?
    3. The authors claim that “this is the first work to enhance the efficiency of fine-tuning in high-resolution settings”, [1] proposed prompt tuning in giga-pixel pathology images, also enhanced the efficiency of fine-tuning, not much though.
    4. The authors mention that they do to use any data augmentation on the input of the frozen LPM. Does this mean that for all the baselines, there is no data augmentation because they do not have a low-resolution input? If it is the case, it means that the baselines do not use any augmentations but the proposed method use augmentations in low resolution part. In this case, this comparison is not fair.
    5. Is the 512x512 the best resolution for all these datasets? There should be some baselines on 224x224 inputs.
    6. Why the memory cost of the proposed method is lower than linear probing? Features preloading should also be possible for linear probing.
    7. How fast is this method compared to the baselines?
    8. The authors use trainable fine-grained prompts to query important tokens at every layer, does it mean that the number of tokens is increasing when the network goes deeper?
    9. For the baseline LoRA, are lora layers added to q,k,v network in all transformer encoders?
    10. Models trained using self supervised methods usually have better performance on medical image datasets because there is limited domain gap between pertaining dataset and downstream data. [1] also shows that prompt tuning performs better than full fine-tuning in this case. Can the proposed method work well on SSL models?

    [1] Zhang, Jingwei, et al. “Prompt-mil: Boosting multi-instance learning schemes via task-specific prompt tuning.” MICCAI 2023

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    As mentioned in the weakness section, I have some concerns on the experiments (especially point 2 and 3). The authors should also justify that this is a tuning method, not a separate network.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is interesting and novel, but the authors need to clarify their experimental setting.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors provided enough explanations in the rebuttal and addressed most of my concerns.



Review #3

  • Please describe the contribution of the paper

    The paper presents Fine-grained Prompt Tuning (FPT), a novel approach to Parameter-efficient Fine-tuning (PEFT) for medical image classification in high-resolution contexts. FPT addresses the challenge of high memory consumption associated with high-resolution medical images by freezing the weights of a large pre-trained model (LPM) and introducing a lightweight side network that operates on lower-resolution images. The method employs fine-grained prompts and a fusion module to effectively adapt pre-trained knowledge from the LPM to the side network, achieving comparable performance to full LPM fine-tuning with significantly reduced parameter and memory requirements.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1, FPT introduces a unique combination of asymmetric input resolution and fine-grained prompts, which is a creative solution to the problem of high memory usage in high-resolution image processing. 2, The method achieves substantial memory savings, using only 13% of the memory costs of a baseline model, while maintaining high performance, which is crucial for practical applications in medical imaging. 3, The paper presents a thorough evaluation of FPT across four medical datasets, demonstrating its effectiveness and efficiency compared to state-of-the-art PEFT methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1, The method’s effectiveness is demonstrated primarily in the medical imaging domain, and its applicability to other domains remains to be explored. 2, The integration of multiple components such as the side network, fine-grained prompts, and fusion module may introduce complexity in understanding and implementing the method. 3, The performance of FPT is contingent on the quality of the LPM, which may limit its effectiveness if the LPM is not well-suited to the specific medical imaging task.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1, Future work could explore the adaptation of FPT to other domains beyond medical imaging to assess its broader applicability. 2, Efforts could be made to simplify the architecture or the training process to make the method more accessible to practitioners. 3, Research could focus on reducing the dependence on LPMs by enhancing the side network’s ability to extract relevant features independently.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    FPT introduces a unique combination of asymmetric input resolution and fine-grained prompts. The method achieves substantial memory savings.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

R1&R3: Specific designs targeting medical image; Explore applicability to other domains. Thank you for the insightful comments. While FPT can be potentially adapted to various domains, it specifically targets medical imaging; its merits are particularly highlighted when applied to medical images. We shall analyze its applicability to other domains in future work. The specific designs of FPT targeting medical imaging includes:

  1. The Important Token Selection is based on prior medical knowledge, given diagnostic clues typically occupy a small proportion of the entire image of interest. For example, lesions in fundus images cover only 2% of an image (the IDRiD dataset). As such, FPT selects important tokens to retain key features and also reduce redundancy.
  2. Current popular LPMs are pre-trained mainly on natural image datasets, which may not fully capture the complexity of medical images. Directly incorporating features from LPMs into the side network may yield very limited improvement. Therefore, we design fine-grained fusion modules, enhancing FPT’s ability to leverage non-medical pre-trained knowledge.
  3. FPT is designed to enhance efficiency in high-resolution settings, a typical requirement in medical imaging.

R1: More behavior analysis on components. Thank you for the suggestion. Several design choices are involved in the proposed components. We have analyzed the ratio of token selection in Table 3. Due to space limitations, more analyses will be included in our journal version.

R3: Fusion module may introduce complexity. Thank you for the comment. Simplifying FPT will be a main future research direction.

R3: FPT is contingent on the quality of LPMs. We agree that FPT depends on the quality of LPMs, which nevertheless is common to transfer learning methods. Reducing such dependence is a very good suggestion, and we will explore it in the future.

R4: Can FPT be called tuning? Thank you for pointing out this matter. We will modify the description of FPT to clarify it as parameter-efficient transfer learning.

R4: Fine-tuning the entire FPT. Fine-tuning the entire FPT performs similarly as full fine-tuning baselines. Due to rebuttal constraints, results will be provided in our journal version.

R4: [1] is proposed for giga-pixel images. Thank you for the comment. We will add this related work in the revised manuscript. [1] makes training on giga-pixel images feasible by splitting the whole image into low-resolution patches, while FPT focuses on enhancing efficiency for high-resolution inputs. For clarification, we will specify our claim in the context of high-resolution inputs in the revised manuscript.

R4: Data augmentation for baselines. We apologize for any confusion. To ensure a fair comparison, all methods were trained employing the same data augmentations as those for low-resolution inputs in FPT.

R4: Is the 512x512 the best resolution? Higher resolution improves performance, but it gradually saturates. The 512x512 resolution is a balanced choice. More analyses will be provided in our journal version.

R4: Preloading for linear probing. Important token selection with feature preloading reduces memory usage. While preloading can be used for linear probing, it requires no augmentation, which would degrade the linear probing performance. For fairness, we used data augmentation for all methods, and thus did not employ preloading.

R4: How fast is FPT? FPT is faster than other PEFT methods and only slightly slower than linear probing.

R4: Is the number of tokens increase? We apologize for the confusion. The prompts in the intermediate sequence of the side network are removed after the forward layer, and thus the number of tokens of each layer is the same.

R4: LoRA details. We added LoRA layers to q and v.

R4: Can FPT work well on SSL models? FPT can adapt to different pre-trained LPMs, including SSL models. Due to rebuttal constraints, we shall provide more analyses on LPM’s adaptability in our journal version.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper received two “Accept” and one “Weak Reject.” I disagree with R3’s (Weak Reject) justification that the proposed method must be specifically designed for medical imaging. A method initially developed for medical imaging that is also effective for natural imaging is commendable for the MICCAI community. Since R3 did not provide a final decision, I gave his rating lower priority. I recognize the paper’s contribution and suggest an “Accept.”

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper received two “Accept” and one “Weak Reject.” I disagree with R3’s (Weak Reject) justification that the proposed method must be specifically designed for medical imaging. A method initially developed for medical imaging that is also effective for natural imaging is commendable for the MICCAI community. Since R3 did not provide a final decision, I gave his rating lower priority. I recognize the paper’s contribution and suggest an “Accept.”



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper presents Fine-grained Prompt Tuning, a novel and parameter-efficient fine-tuning method aimed at reducing memory consumption in high-resolution medical images. Strengths noted by reviewers include the innovative approach, substantial memory savings (using only 13% of the memory of a baseline model), detailed presentation, thorough evaluation across four medical datasets, and provision of reimplementation codes. While some weaknesses were identified, such as limited demonstration outside medical imaging, complexity in understanding the method, and dependency on the quality of the large pre-trained model, the well-organized structure and significant memory savings outweighed the weaknesses identified. Thus I suggest accepting this paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper presents Fine-grained Prompt Tuning, a novel and parameter-efficient fine-tuning method aimed at reducing memory consumption in high-resolution medical images. Strengths noted by reviewers include the innovative approach, substantial memory savings (using only 13% of the memory of a baseline model), detailed presentation, thorough evaluation across four medical datasets, and provision of reimplementation codes. While some weaknesses were identified, such as limited demonstration outside medical imaging, complexity in understanding the method, and dependency on the quality of the large pre-trained model, the well-organized structure and significant memory savings outweighed the weaknesses identified. Thus I suggest accepting this paper.



back to top