Abstract

Medical foundation models pre-trained on large-scale datasets have demonstrated powerful versatile capabilities for various tasks. However, due to the gap between pre-training tasks (or modalities) and downstream tasks (or modalities), the real-world computation and speed constraints, it might not be straightforward to apply medical foundation models in the downstream scenarios. Previous methods, such as parameter efficient fine-tuning (PEFT) methods and knowledge distillation (KD) methods, are unable to simultaneously address the task (or modality) inconsistency and achieve personalized lightweight deployment under diverse real-world demands. To address the above issues, we propose a novel framework called Reprogramming Distillation (RD). On one hand, RD reprograms the original feature space of the foundation model so that it is more relevant to downstream scenarios, aligning tasks and modalities. On the other hand, through a co-training mechanism and a shared classifier, connections are established between the reprogrammed knowledge and the knowledge of student models, ensuring that the reprogrammed feature space can be smoothly mimic by the student model of different structures. Further, to reduce the randomness under different training conditions, we design a Centered Kernel Alignment (CKA) distillation to promote robust knowledge transfer. Empirically, we show that on extensive datasets, RD consistently achieve superior performance compared with previous PEFT and KD methods. Source code is available at: https://github.com/MediaBrain-SJTU/RD



Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0912_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Zho_Reprogramming_MICCAI2024,
        author = { Zhou, Yuhang and Du, Siyuan and Li, Haolin and Yao, Jiangchao and Zhang, Ya and Wang, Yanfeng},
        title = { { Reprogramming Distillation for Medical Foundation Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a framework called reprogramming distillation so foundation models can be fine-tuned more effectively towards downstream tasks. They propose two components: re-programming and CKA. With these methods there are able to efficiently fine-tune a large models and outperform existing methods on the datasets they evaluate on.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -The problem being solved in this paper is a very relevant problem in medical imaging

    • Using CKA in combination with student-model based alignment seems like an interesting way to fine-tune models. The additional branch of the student model can help to reduce potential bias in the foundation model. ‘
    • Performance of the method is very good. It performs well on competitive public medical datasets.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • the results shown in the paper are not presented well. Specifically the results in table 4-6 may be better presented in a plot.
    • While the results convincingly show the better performance compared to previous methods, there is no explanation offered on why the method performs better than earlier methods. It would be good to introduce the strategy behind some of the competing methods and explain why the proposed method outperforms them.
    • The ablation studies need to be more elaborate. As far as I can see the only result that supports using the CKA component is one data point (table 2).
    • The method is described too coarsely. No details on the implementation are provided (hyperparameters, GPU setup). This is important for a paper on efficient fine-tuning.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The frozen sign in figure 1 needs to be more visible.
    • Some terminology is not well defined in the paper. For example: how do the authors define a foundation model?
    • It is hard to interpret Figure 2. Is there a better way to convey this information?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Lack of details on implementation /computation details
    • presentation of results
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors introduce a groundbreaking framework known as Reprogramming Distillation (RD). This framework is designed to aid the downstream adaptation of foundational models in clinical contexts. To evaluate their model, the authors utilize three distinct types of medical foundational models and five different datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The writing of this paper is pretty good. The authors present their ideas with a high degree of logical coherence.

    2. This work holds significant value in clinical settings.

    3. The experiment has been conducted comprehensively.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The author has neither provided an anonymized link to the source code nor promised to release the source code upon acceptance of the submission.

    2. The author has not provided the corresponding implementation details, such as how some hyperparameters in the method section were set.

    3. The authors have not elaborated on some of the details clearly in the secion of experiment. More details can be found in the constructive comments.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    It would be better for authors to release the code and provide more implementation details.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. It would be beneficial if the authors could provide more implementation details, such as the weights of various components in the loss function, the initial learning rate for training, and the batch size, among others. Additionally, it would be preferable for the authors to release the code if the work is accepted.

    2.The authors have not clearly explained some of the details. For instance, in Table 2, the authors did not specify on which dataset these experimental results were obtained. In Figure 2, the authors did not explain how these decision boundaries were generated.

    1. The authors mentioned that “different model structures, data distributions, random seeds, etc., may introduce more unnecessary noise and increase uncertainty in feature distillation”. Does the “noise” mentioned here inevitably have a negative impact on feature distillation? The authors are requested to provide a detailed explanation on this matter (or provide relevant papers).
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty, writing, and experiments of the paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The novelty, writing, experiments and the feedback from authors.



Review #3

  • Please describe the contribution of the paper

    This paper introduces Reprogramming Distillation (RD), a novel framework that effectively adapts medical foundation models to downstream tasks by reprogramming their feature spaces and employing a co-training mechanism with Centered Kernel Alignment distillation. RD demonstrates superior performance across various datasets by aligning tasks and modalities and promoting robust knowledge transfer. It significantly outperforms existing methods like PEFT and KD in terms of versatility and effectiveness.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Method Innovation: The RD framework is an intriguing approach that addresses the limitations of existing adaptation techniques, offering a significant step forward in model reprogramming and knowledge distillation.
    • Comprehensive Experiments: The experiments are thorough, involving multiple datasets and comparative analysis with several state-of-the-art methods, providing solid evidence of the framework’s effectiveness.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Comparison Detailing: While RD outperforms other models like Hint, AT, VID, etc., the paper could benefit from a more detailed explanation on how these models were adapted and trained for the specific tasks in the absence of official training on the test datasets like BUSI and ISIC. This will help readers understand the basis of performance comparisons.
    • Model Differentiation: It would enhance reader comprehension to clearly differentiate between the capabilities and roles of the ‘teacher’ and ‘student’ models within the RD framework. Describing these distinctions can provide a clearer picture of their individual contributions to the overall performance gains observed in the study.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    While the paper effectively showcases the performance gains of Reprogramming Distillation (RD) compared to existing methods, it would greatly benefit from a more detailed discussion on how RD differentiates itself from these methods at a conceptual and technical level.

    Consider expanding the discussion to include a theoretical analysis or a deeper dive into the underlying mechanisms that enable RD to surpass other techniques. This could involve detailing the specific aspects of the reprogramming and co-training mechanisms that contribute to its superiority in aligning feature spaces and facilitating knowledge transfer.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Very solid experiments and original method.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We sincerely thank all reviewers for the valuable comments and the recognition of our work, which we take seriously to improve the manuscript. The detailed concerns are addressed as follows: R1 (Accept): 1)Comparison detailing. For these KD methods, we freeze the teacher model and directly use its output as supervision for the student model without adaptation, as the foundation model contains general medical knowledge that can guide the student model. 2)Model differentiation. In RD, the teacher model refers to a large model pre-trained on broad datasets. Its role is to provide guidance for the student model. The student model is a lightweight model whose role is to learn downstream tasks better under guidance. 3)Method Difference. Conceptually, the difference between RD and KD is that RD considers the task inconsistency, while the difference between RD and PEFT is that RD considers model lightweighting and cross-model knowledge transfer. Technically, RD additionally introduce co-training reprogramming and CKA distillation. 4)Mechanism discussion. Co-training reprogramming encourages teacher and student models to simultaneously adapt to downstream tasks and learn more easily transferable features under the constraint of a shared classifier. CKA distillation focuses on the transfer of core high-level information rather than low-level information that is easily disrupted by noise.

R3 (Weak Accept): 1)We will add the more detailed implementation description including hyperparameter settings, GPU setup, and the training strategy w.r.t. optimizer, learning rate and batch size, and promise to release the code upon acceptance. 2)Regarding experimental results, Table 2 is implemented on dataset BUSI, and Figure 2 is generated following the tool from [1]. We will follow the reviewer’s advice to enrich the necessary details for clarity. 3)Explanation about noise. Noise is inevitably present in KD because the supervision provided by the teacher model cannot be completely accurate, which may mislead the training of the student model, leading to negative transfer. More related discussion can be found in [2]. We will add the relevant explanation and papers for support this statement.

[1] Can Neural Nets Learn the Same Model Twice? Investigating Reproducibility and Double Descent from the Decision Boundary Perspective [2] Discrepancy and Uncertainty Aware Denoising Knowledge Distillation for Zero-Shot Cross-Lingual Named Entity Recognition

R4 (Weak Reject): 1)Presentation. We will take your suggestions to comprehensively improve our presentation, including results and the frozen sign. 2)Performance explanation. The main factors contributing to performance improvement are co-training reprogramming, and CKA distillation. The former encourages teacher and student models to simultaneously adapt to downstream tasks and learn more easily transferable features under the constraint of a shared classifier. The latter focuses on the transfer of core high-level information rather than low-level information that is easily disrupted by noise. 3)More elaborate ablation. We will add more elaborate ablation studies for CKA and Co. reprog under differet datasets to make it more convincing. 4)We will add the more detailed implementation description including hyperparameter settings, GPU setup, and the training strategy w.r.t. optimizer, learning rate and batch size, and promise to release the code upon acceptance. 5)Terminology Definition. The foundation model is defined as a model pre-trained on board datasets, thereby possessing general knowledge. We will carefully check unclear terminology definitions in our paper. 6)Presentation of Figure 2. Thank you for your suggestions. The purpose of Figure 2 is to explain the improvement of performance from the perspective of decision boundaries. We will provide more descriptions for Figure 2 and highlight the core information, and explore some numerically auxiliary ways to convey information.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper introduces a novel framework of Reprogramming Distillation (RD), which is designed to aid the downstream adaptation of foundational models in clinical contexts. In evaluation, three foundational models and five different datasets are utilized. The targeting pblem of foundation model is interesting and promising. The writing is clear to understand. Therefore, I would recommend accept.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper introduces a novel framework of Reprogramming Distillation (RD), which is designed to aid the downstream adaptation of foundational models in clinical contexts. In evaluation, three foundational models and five different datasets are utilized. The targeting pblem of foundation model is interesting and promising. The writing is clear to understand. Therefore, I would recommend accept.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper introduces Reprogramming Distillation (RD) designed to adapt medical foundation models to downstream tasks by reprogramming their feature spaces and employing a co-training mechanism with Centered Kernel Alignment distillation. Strengths include its innovative approach, high clinical relevance, and comprehensive experimentation. In addition, RD demonstrates superior performance across various datasets, promoting robust knowledge transfer, and outperforming existing methods in versatility and effectiveness. However, weaknesses include insufficient implementation details, unclear presentation of results, and lack of ablation studies. The paper lacks more detailed explanations of model adaptation, hyperparameters, and the impact of noise in feature distillation. Given the strengths and weaknesses of this paper, I would recommend accepting this paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper introduces Reprogramming Distillation (RD) designed to adapt medical foundation models to downstream tasks by reprogramming their feature spaces and employing a co-training mechanism with Centered Kernel Alignment distillation. Strengths include its innovative approach, high clinical relevance, and comprehensive experimentation. In addition, RD demonstrates superior performance across various datasets, promoting robust knowledge transfer, and outperforming existing methods in versatility and effectiveness. However, weaknesses include insufficient implementation details, unclear presentation of results, and lack of ablation studies. The paper lacks more detailed explanations of model adaptation, hyperparameters, and the impact of noise in feature distillation. Given the strengths and weaknesses of this paper, I would recommend accepting this paper.



back to top