Abstract

As natural image understanding moves towards the pretrain-finetune era, research in pathology imaging is concurrently evolving. Despite the predominant focus on pretraining pathological foundation models, how to adapt foundation models to downstream tasks is little explored. For downstream adaptation, we propose the existence of two domain gaps, i.e., the Foundation-Task Gap and the Task-Instance Gap. To mitigate these gaps, we introduce PathoTune, a framework designed to efficiently adapt pathological or even visual foundation models to pathology-specific tasks via multi-modal prompt tuning. The proposed framework leverages Task-specific Visual Prompts and Task-specific Textual Prompts to identify task-relevant features, along with Instance-specific Visual Prompts for encoding single pathological image features. Results across multiple datasets at both patch-level and WSI-level demonstrate its superior performance over single-modality prompt tuning approaches. Significantly, PathoTune facilitates the direct adaptation of natural visual foundation models to pathological tasks, drastically outperforming pathological foundation models with simple linear probing. The code is available at https://github.com/openmedlab/PathoDuet.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1648_paper.pdf

SharedIt Link: https://rdcu.be/dY6iQ

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72083-3_37

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1648_supp.pdf

Link to the Code Repository

https://github.com/openmedlab/PathoDuet

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Lu_PathoTune_MICCAI2024,
        author = { Lu, Jiaxuan and Yan, Fang and Zhang, Xiaofan and Gao, Yue and Zhang, Shaoting},
        title = { { PathoTune: Adapting Visual Foundation Model to Pathological Specialists } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {395 -- 406}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper is focused on the adaptation of pre-trained foundation models to histology patch- and WSI-level classification. Concretely, it addresses both natural-image supervised pre-trained models and domain-specific self-supervised pre-training. The authors propose PathoTune, using additive Prompt Tuning for parameter-efficient fine-tuning of ViTs. Three prompts are employed: Task-specific Visual Prompts (TSVP), Task-Specific Textual Prompts (TTP), and Instance-specific Visual Prompts (IVP).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The problem tackled is relevant given the current rise of medical foundation models.
    • Authors validate the methods with different stainings on several datasets and tasks.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Contribution of PathoTune. 1.1. The efficiency of PathoTune for adapting foundation models relies on the small number of parameters tunned, i.e. peak GPU resources consumption, which is well-known from [14,27]. Nevertheless, a lower number of tunned parameters does not necessary translate to faster convergence during adaptation and lower training times (see [a] Figure 10, 11, or 12), which also largely influence training efficiency. 1.2. Methodological contributions (TVP, TTP, and IVP). TVP is a straightforward application of Deep Visual Prompt Tuning [14,27]. Thus, the two contributions are TTP and IVP. TTP is based on an embedding from a pre-trained text encoder (BERT) per task, and a projection layer. I have two concerns in this module: (i) there are no specific details about such projection layer, and (ii) The text template for the embedding contains information for the task at hand, but is BERT able to properly encode such fine-grained, expert, medical information? This effect is not properly validated. Concerning IVP, the methodological description of theta_{VRM} needs to be more specific. The authors state that it is initialized with the first 4 layers of ResNet18. Are these layers trained also? Otherwise, which is the pre-trained data source for ResNet18?
    2. Unclear implementation details. Authors state that “In most experiments, the token number for TVP, TTP, and IVP is set at 10, 2, and 2, respectively”. What do they mean by “in most cases”? The authors do not explicitly claim to use a validation set to fix such hyperparameters per task
    3. Other. 3.1. Missing relevant pathological foundation models. The addressed pathology foundation models are all pre-trained using self-supervised learning. There are recent vision-language foundation models for pathology, which have demonstrated promising efficient adaptation, especially using Linear Probing, for example [b, c]. Nevertheless, the authors refer to pathological foundation models in a general manner. 3.2. Missing baselines on PEFT. Linear Probing in self-supervised pre-trained models might be weak, as found in [24]. Nevertheless, other selective PEFT methods such as Bias [d] or LayerNorm [e] Affine tuning or LoRA [f] are not included. 3.4. Results. It appears that the largest improvements come from TVP and IVP. Thus, it would be interesting to check the results in Table 1 using only such prompts (not “multimodal”). [a] Facing the Elephant in the Room:Visual Prompt Tuning or Full finetuning? (2024)ICLR. [b] A visual-language foundation model for pathology image analysis using medical Twitter (2023)Nature. [c] Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images (2023)CVPR. [d] BitFit:Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models (2021)ACL. [e] Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs (2021)ICLR. [f] LoRA: Low-Rank Adaptation of Large Language Models (2022)ICLR
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Provide a deeper evaluation of the data- and parameter- efficiency gains of PathoTune.
    • Include a more detailed description of the experimental setting, and the strategy for hyper-parameters setting.
    • Evaluate recent domain-specigic vision-language models.
    • Evaluate PEFT methods beyond prompt learning.
    • In my opinion, the second part of Section 3.1. - i.e. Eq (1) and Eq (2) do not add much to the paper, since most information is repeated, and the mathematical formulation is not further employed.
    • Including average metrics across tasks would be beneficial for the reader to get a fast overall view.
    • The datasets on which the Fig. 4 evaluation is carried out are not specified
    • Study the effect of the text template in TTP: What happens if you use a wrong stain and task name? Is BERT able to encode such details?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The efficiency of PathoTune compared to Fine-Tuning is not properly evaluated. In addition, there are several methodological choices not properly motivated, missing baselines, and unclear implementation details, which overall motivate my recommendation (see Weaknesses).

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    I have carefully considered the author’s rebuttal and other reviewers’ comments. In my opinion, after the author’s rebuttal, there are still significant unresolved concerns that motivate maintaining my initial score: reject. I provide a rationale below.

    • PathoTune Efficiency. This issue is especially relevant since Fine-tuning (FT) provides consistently better performance (Table 1). No quantitative parameter-, training-, or inference-efficiency analysis is introduced. In addition, PathoTune introduces extra inference parameters via Prompts and the IVP backbone (RN18).
    • Missing PEFT baselines (see my initial review for more details). The authors do not clarify my initial concern. The rebuttal does not include a clear rationale for the exclusion of popular PEFT methods apart from prompt-based PEFT. In this regard, I sincerely do not understand what @R5 refers to with “comparison to non-LoRA” methods since this work only includes prompt-based methods. Indeed, LoRA, like other PEFT methods, is not included as a baseline.
    • Unclear contribution of key elements. The text prompt TTP uses task-related information via text templates. No ablation study is introduced to demonstrate that the expert text prompts are meaningful for the target task (e.g. using random text templates). Based on the ablation study of TTP (see Fig.4), this module provides minor improvements, and most performance gains come from IVP (see Tab.1/Fig.4). Note that TTP is a key element, that motivates multi-modality claims.
    • Missing increasingly popular vision-language foundation models beyond self-supervised pre-trained models (see my initial review). For all the above, I kindly disagree with other reviewers’ recommendations and think that this work, in its current form, is not suitable for publication at MICCAI. Considering that the authors propose a novel PEFT method, it seems inconceivable that the parameter-, training- and inference efficiency would not be evaluated.



Review #2

  • Please describe the contribution of the paper

    This paper introduces PathoTune, a framework that employs multi-modal prompts including task-specific visual prompts, task-specific textual prompts and instance-specific visual prompts to efficiently adapt pre-trained visual or pathological foundation models to downstream pathological tasks by addressing the foundation-task gap and task-instance gap.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors propose a novel framework of PathoTune to efficiently adapt foundation models to pathological tasks using multi-modal prompt tuning. The method can explicitly identifies two types of domain gaps in model adaptation and addresses them with different prompts. Comprehensive evaluations are conducted on multiple public and private datasets. The method successfully facilitates direct transfer of natural vision models to pathological tasks.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The authors did not perform interpretability analysis on the generated hints, such as the distribution and correlation of the hint embedding space, making it difficult to understand the true impact of the hint mechanism on the model.
    2. The method may have potential risk of overfitting to prompts as they are key model parameters.
    3. The authors do not compare their proposed methods with SOTA pathological diagnostic solutions.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please see Point 6.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is interesting to adapt natural vision foundation models for pathological diagnosis, but more experiments and analysis would better demonstrate the effectiveness of the proposed solution.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The lack of interpretability study will still affect the level of this submission, but I think after the current version will be above the level of MICCAI, and this submission will be interesting to MICCAI readers.



Review #3

  • Please describe the contribution of the paper

    The authors propose PathoTune, a framework for efficiently adapting (natural or pathological) foundation models to specific downstream tasks. They propose to use three learnable prompts: dataset-specific visual and textual prompts that guide model towards the current task and an instance-specific visual prompt, which provides the model with refined image features. The results show all of the introduced prompts have a positive effect on performance and the proposed adaptation technique outperforms linear probing and LoRA based approaches.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper proposes a novel combination of multi-prompts for adapting foundation models in computational pathology.
    • The paper includes detailed and interesting ablation studies showing the effect of the proposed prompt tuning on both natural and pathological foundation models.
    • The results indicate the effectiveness of the method compared to other parameter-efficient fine-tuning approaches.
    • The paper is clearly written and easy to follow.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The technical novelty is limited given the prior works in prompt tuning and domain adaptation. The defined foundation gaps seem to be a re-naming of known concepts and not specific to pathology (FTG seems to describe the general domain gap between pre-training and fine-tuning, while TIG states that every instance in a dataset is different (e.g. lighting variations in natural images).
    • The paper does not provide a comparison to non-LORA based SOTA methods in the downstream tasks.
    • The intuition behind the IVP tokens is unclear, why is it so beneficial to give the same visual information in two ways?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • It would be valuable to see a comparison to standard single-task foundation models in the different tasks to better understand the performance level of the proposed method.
    • A methodological comparison to the Prompt-MIL [1] paper would be interesting, as this paper also proposed prompt tuning for pathological image analysis.
 [1] Zhang, Jingwei, et al. “Prompt-mil: Boosting multi-instance learning schemes via task-specific prompt tuning.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2023.
    • Please make the intuition for the IVP tokens clearer. Minor:
    • on page 3, “contrast learning” should be “contrastive learning”
    • In Table 1, is the F1 score of 67.5 on CVI-HE for ImageNet (ViT-S) correct or could it be 17.5? It seems like a huge outlier compared to adding TVP in all other cells.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed mechanism is novel for the field, is intuitive and seems to be effective, which was shown on a large set of tasks. However the comparison to prior work could be improved.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors addressed my main concerns regarding comparison to non-lora methods and the intuition of the IVP token, therefore I recommend acceptance.




Author Feedback

Thanks for the valuable comments.

R1, R5: Contribution of PathoTune. Current research on pathological models primarily focuses on the pre-training phase. PathoTune introduces a multi-modal prompt strategy to adapt either the visual foundation or the pathological foundation model to downstream pathology tasks, which is one of the few approaches focusing on pathological PEFT and the first to address domain gaps using varied types of prompts. The two defined foundation gaps are not just renamed, but a new concept proposed in this paper for the pretraining-finetuning era for pathology modeling, where such domain gaps are more significant than natural images. To mitigate these domain gaps, the paper proposes corresponding prompts based on the easy but effective principle, while outperforming current PEFT-based SOTA methods by a large margin, providing a new paradigm for downstream pathological applications.

R3, R5: Comparisons with Non-PEFT SOTA Methods. Due to page limitations, we only present the comparisons with the PEFT-based SOTA methods in the main paper. Actually, we have compared PathoTune with other Non-PEFT SOTA methods. The results of the proposed PathoTune outperform the current SOTA pathology expertise models, achieving 97.5% vs 95.2% on RJ-Prost, 99.8% vs. 99.2% on NCT, and 97.6% vs 97.1% on SICAPv2. We will include these results in the supplementary material.

R1: Efficiency of PathoTune. It is true that the reduction in the number of parameters does not directly equate to faster convergence or lower training times. However, this is a concern with all prompt-based PEFT methods. Admittedly, compared to adapter, LoRA, and other PEFTs, we believe that prompt is more extensible. As a token on the input side, it allows the design of three prompts with different modals for two domain gaps. This approach also enables the inject of external domain information (e.g., textual descriptions and encoded visual features) into the network. As for the consideration of efficiency, we will add a figure to the supplementary material to show precision-parameter comparison between PathoTune and other methods to verify the balance of these two factors.

R1: Implementation Details. Some implementation details had to be streamlined due to page limitations. BERT is used as the encoder for TTP since experiments revealed that existing medical-specific BERT encoders (e.g., BioLinkBERT, BlueBERT) perform even worse than BERT. The followed projection layer is a single trainable Linear layer. The theta_{VRM} for IVP is trainable and initialized from the first 4 layers of ResNet-18 pretrained on ImageNet. For each dataset, we follow the same data evaluation criteria as other papers. The optimal number of TVP, TTP, and IVP combinations varies, but near-optimal performance is generally achieved with 10, 2, 2. In the final manuscript, we will provide a more detailed implementation description.

R5: Intuition for the IVP tokens. IVP tokens are designed to mitigate Task-Instance Gap. The ideal approach to describe global staining and glandular features is employing a simple visual encoder. Compared to TTP or TVP, IVP based on the original image better portrays coarse-grained instance-level features, complementing the fine-grained flattened patches.

R3: Risk of Overfitting to Prompts. It is challenging to assess whether the model is overfitting the prompts. However, to avoid overfitting to the test data, we have employed 4-fold cross-validation for all datasets except BCI.

R1: Results using Single TVP and IVP. The results using a single TVP and IVP can be seen in Fig. 4.

R1: Reference to Other Pathological Foundation Models. We will add the references to these vision-language foundation models in the final version.

R1, R3: Other Additional Experiments. We recognize that additional experiments would enhance our experimentation. However, due to page limitations and MICCAI guidelines, we had to present the most important ones in our main paper.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper addresses the issue of fine tuning pre-trained models for downstream tasks. This is a topic that is of growing importance now that many foundation models for digital pathology have been published and the methods proposed (vision and text prompt tuning) are of great interest to many in the community at present. Reviewers have commented on the lack of some key ablation experiments. In particular, they have not demonstrated that their approach is more resource-efficient that fine tuning and this is a major flaw in their argument. The paper is well written although a little vague in some details; hopefully the source code will resolve this issue. The number of ablation experiments that can be realistically presented in a short conference submission is limited, but the paper is well written, the validation has been performed on a number of relevant datasets (both public and private) , and the topic is of great interest.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper addresses the issue of fine tuning pre-trained models for downstream tasks. This is a topic that is of growing importance now that many foundation models for digital pathology have been published and the methods proposed (vision and text prompt tuning) are of great interest to many in the community at present. Reviewers have commented on the lack of some key ablation experiments. In particular, they have not demonstrated that their approach is more resource-efficient that fine tuning and this is a major flaw in their argument. The paper is well written although a little vague in some details; hopefully the source code will resolve this issue. The number of ablation experiments that can be realistically presented in a short conference submission is limited, but the paper is well written, the validation has been performed on a number of relevant datasets (both public and private) , and the topic is of great interest.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper proposes a foundation model to be used in pathology applications. The paper is well-written and touches on an interesting application. However, the reviewers have mixed recommendations. R1 sticks with the rejection recommendation and provides detailed comments. The other two reviewers while providing valid constructive feedback increased their scores after rebuttal. R1’s comments are valid and it would be great if the authors could address those in the next version of the paper (even if they have to include some of those in the supplement, including all those in an 8-page paper will be impossible).

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper proposes a foundation model to be used in pathology applications. The paper is well-written and touches on an interesting application. However, the reviewers have mixed recommendations. R1 sticks with the rejection recommendation and provides detailed comments. The other two reviewers while providing valid constructive feedback increased their scores after rebuttal. R1’s comments are valid and it would be great if the authors could address those in the next version of the paper (even if they have to include some of those in the supplement, including all those in an 8-page paper will be impossible).



back to top