Abstract

In recent years, prompting pre-trained visual-language (VL) models has shown excellent generalization to various downstream tasks in both natural and medical images. However, VL models are sensitive to the choice of input text prompts, requiring careful selection of templates. Moreover, prompt tuning in the weakly supervised/multiple-instance (MIL) setting is fairly under-explored, especially in the field of computational pathology. In this work, we present a novel prompt tuning framework leveraging frozen VL encoders with (i) residual visual feature adaptation, and (ii) text-based context prompt optimization for whole slide image (WSI) level tasks i.e., classification. In contrast with existing approaches using variants of attention-based instance pooling for slide-level representations, we propose synergistic prompt-based pooling of multiple instances as the weighted sum of learnable-context and slide features. By leveraging the mean learned-prompt vectors and pooled slide features, our design facilitates different slide-level tasks. Extensive experiments on public WSI benchmark datasets reveal significant gains over existing prompting methods, including standard baseline multiple instance learners.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2700_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2700_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Chi_LowShot_MICCAI2024,
        author = { Chikontwe, Philip and Kang, Myeongkyun and Luna, Miguel and Nam, Siwoo and Park, Sang Hyun},
        title = { { Low-Shot Prompt Tuning for Multiple Instance Learning based Histology Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces a novel prompt tuning framework for few-shot, weakly supervised classification in histology. It enhances the learning of slide-level features through a context-driven instance pooling method. Extensive experiments validate its effectiveness, showing superior performance over existing methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The performance of this method shown in Table 1 seems to be competitive.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The organization of this paper is not good, making it hard to follow. The Introduction part doesn’t explicate clearly what this paper has done and why.
    2. Some symbols and formulas are confusing. For example, the $A_\psi^v$ is used before definition, and the Formula 5 describes a softmax function but uses a scalar as its input.
    3. Why to use K additional prompt vectors rather than one for every class? The effect of hyperparameter K is not evaluated.
    4. The text size in Figure 1 is to small to read clearly, and what each color of box represent (except red box) is not illustrate.
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Some typos: leveraged used (maybe duplicated), two +Ours in Table 3.
    2. Better to reorganize the paper and clearly present the contribution in the Introduction.
    3. Some formulas (like formula 5) and symbols should be reconsidered.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please see the weaknesses part.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    After reading author’s feedback and other reviewers’ comments, I will raise my score to Weak Accept.



Review #2

  • Please describe the contribution of the paper

    This paper introduces a new prompt tuning framework for pre-trained visual-language (VL) models, designed to enhance performance on whole slide image (WSI) classification tasks in computational pathology. This framework addresses the sensitivity of VL models to text prompt variations by integrating residual visual feature adaptation and text-based context prompt optimization. Unlike traditional methods that utilize attention-based instance pooling for slide-level representations, this novel approach employs synergistic prompt-based pooling. This method pools multiple instances through the weighted sum of learnable-context and slide features, leveraging mean learned-prompt vectors.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The focus of this paper is on employing prompt learning for low-shot learning downstream tasks based on the VL model, which is a highly intriguing approach. Additionally, the problem description and logic in the paper are clear.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The paper lacks innovation, and the descriptions of existing methods are unclear, containing significant errors. In the last paragraph of the introduction, the authors mention their study on “few-shot weakly supervised classification” as “filling the blank” in this area. However, in the penultimate paragraph, I carefully reviewed the brief mention of existing work by the authors: “As opposed to visual-only prompting in PromptMIL [19] and using multiple prompt learning in Qu et al. [19], we introduce a simpler design for text-based prompting in few-shot samples with visual adaptation.” Significant issues are as follows:

    -1) The authors state both PromptMIL and Qu et al. [19] with the same citation, [19], which likely indicates an error.

    -2) The authors do not provide a detailed description of the existing work or how their own work differs from it. Specifically, what are the significant differences between their work and the works of PromptMIL and Qu et al. [19]? What are the approaches of these previous works? Additionally, using “fill the blank” is inappropriate given the existence of these two studies.

    -3) The innovation is insufficient. I read Qu et al. [19] in detail, and the “context-driven instance pooling” introduced in this paper is very similar to the “prompt guided pooling” in that paper. Both employ “language-driven” pooling; what are the differences and advantages of this paper’s method?

    1. The results lack direct comparisons with the most relevant methods, PromptMIL and Qu et al. [19]. Moreover, the “context-driven instance pooling” proposed in this paper is not compared with the “prompt guided pooling” from Qu et al. [19].

    2. Based on my own experience and existing literature [1, 2], the performance of AbMIL in Fig. 2 should not be as poor as depicted in this paper, which casts doubt on the credibility of the results presented. If these are low-shot results, please specify the exact number of shots.

    3. How are CoOp/CoOp-GCE and Adapter, etc been compared in WSI classification? Are the comparisons in this paper fair?

    [19] Qu, L., Luo, X., Fu, K., Wang, M., Song, Z.: The rise of ai language pathologists: Exploring two-level prompt learning for few-shot weakly-supervised whole slide image classification (2023)

    [1] Li, B., Li, Y., & Eliceiri, K. W. (2021). Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14318-14328).

    [2] Zhang, H., Meng, Y., Zhao, Y., Qiao, Y., Yang, X., Coupland, S. E., & Zheng, Y. (2022). Dtfd-mil: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18802-18812).

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please see weaknesses.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please carefully address the weaknesses.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I carefully read the author’s rebuttal, which addressed most of my concerns. Overall, while the novelty of this paper has some overlaps with existing work, it still provides some new insights to the field. I have consequently raised my final score to a weak accept.



Review #3

  • Please describe the contribution of the paper

    While zero or few-shot visual-language models have been successfully adapted to many computer vision tasks, there is limited research on their application in Multiple Instance Learning (MIL) for histopathology. To address this gap, the authors propose a few-shot model that leverages prompt adaptation, weighting instances in a Whole Slide Image (WSI) in relation to each prompt.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The problem addressed (few-shot Multiple Instance Learning (MIL) for histopathology) is interesting
    • The results demonstrate a clear improvement in performance compared to the baseline.
    • The experiments are comprehensive.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors reference three papers on the topic of prompt-based few-shot learning for Multiple Instance Learning (MIL) in histopathology [17, 19, 26], yet they do not compare their proposed method to these works.
    • The adaptable prompt, which constitutes a significant portion of the paper’s contribution, is not a novel concept, even within the histopathology field [19, 26].
    • Overall, the paper is challenging to read and comprehend.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The description of the method is somewhat challenging to follow. While I assume that with a careful reading, the method could be reproduced, as it does not seem overly complex, I cannot confirm this with certainty at this stage.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Table 1 is too small. It would be better to split it into two tables, one per dataset, or remove the Accuracy (ACC) metric and only show the Area Under the Curve (AUC). Accuracy could be moved to the supplementary material.
    • The paper could also show some of the learned prompts as natural language words through nearest neighbor methods, or visualize them using t-SNE.
    • Typo: ‘As opposed to visual-only prompting in PromptMIL [19]’ the reference should be [26].
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method is well-founded and effectively motivated. The results demonstrate a clear improvement over the proposed baseline, and the ablation study (Table 3) provides justification for the contribution. However, the experimental section lacks comparative baselines and sufficient related work to offer broader context.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors successfully addressed my concerns regarding comparison with previous methods.



Review #4

  • Please describe the contribution of the paper

    This work tackles the efficient adaptation of pathology vision-language models to Multiple Instance Learning (MIL). Building upon the recently popularized prompt learning, the authors propose a novel MIL aggregator based on context-pooling. Concretely, the instance-level attention weights are obtained by softmax similarity of the text embeddings of K visual prompts per task. The final method is consolidated including CLIP-Adapter to refine instance-level vision features, and textual classwise prototypes, obtained using the K visual prompts per class.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The problem tackled is relevant given the current rise of pathology vision-language foundation models.
    2. The methodological description is easy to follow and clearly explained, and each method’s component is well-motivated. Methodological contributions are sound.
    3. The experiments carried out are appropriate, and the obtained results are promising, showing the contribution of the proposed methods in the few-shot regime.
    4. The authors show ablation experiments that validate the proposed Context Pooling MIL aggregator, and adding CLIP-Adapter on top of the instance-level visual features.
    5. Tumor localization evaluation (instance-level) is quantitative and qualitative evaluated.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Adapter baselines. Some recent adapters for few-shot efficient transfer learning of VLMs, i.e. [a] and [b], are not considered.
    2. Linear Probing Implementation. It is used as a baseline using MaxMIL and AbMIL pooling. Nevertheless, it is not clearly stated how the classifier is implemented. For example: are the class prototypes randomly initialized, or are the zero-shot prototypes used for initialization? Are the trained prototypes projected into an l2-norm space as authors do in the proposed method (Eq. (8))? Recent literature in [b] suggests that those might be important details, especially in the low-shot data regime.
    3. Ablation studies. Regarding the number K of prompt pairs, it is fixed at 4 for all methods, which is fair. Nevertheless, it would be interesting to show an ablation experiment in this regard for the proposed method and datasets to evaluate its sensibility. [a] Task Residual for Tuning Vision-Language Models (2023), CVPR. [b] Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models (2023), CVPR.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Authors do not share open-access code, nor promise to do so upon acceptance. As stated, a few important implementation details are missing for some baselines, and the proposed method has several particularities in trainable prompt construction and context-driven pooling, which open code would facilitate reproducing.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • If gamma scaling in Eq. (5) is the pre-trained temperature scaling parameter and remains fixed, I would suggest using tau as in Eq. (1) for easier readability.
    • The sentence in the introduction “Note that fine-tuning VL models … may damage learned features.” is okay, but it would benefit from concrete bibliographic references (e.g. [c] or [d]). [c] Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution (2022), ICLR. [d] Robust fine-tuning of zero-shot models (2022), CVPR.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is sound, and the problem tacked is relevant, given the rising relevance of vision-language models in pathology, and the popularity of few-shot prompt learning and adapters to computer vision. Its adaptation to Multiple Instance Learning is not straightforward, and there is little literature on the topic yet. Although I have some concerns regarding particular implementation details of the baselines, reproducibility, and missing ablation studies, I think this work is a valuable contribution to MICCAI.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I have carefully reviewed the rebuttal and other reviewers’ concerns. The most significant concerns were the limited performance of AbMIL and missing related works. For the first, such results are aligned with randomly initialized Linear Probing performance in the few-shot regime reported in CLIP. Nevertheless, according to the author’s rebuttal, I find strange the way Linear Probe (e.g. in AbMIL) was implemented (it would have been good to discuss this topic with fellow reviewers). The authors train a linear projection weight of DxD and then compute cosine similarity with frozen text embeddings (DxC), with C the classes, while a more natural LP implementation would have been to directly train the classifier weights DxC (initializing this layer with text embeddings), applied over image features. There is relevant recent literature revisiting Linear Probing in CLIP adaptation exploring the importance of the performance of such details [a,b]. Nevertheless, these trends might not necessarily translate to the specific case of MIL. Second, regarding the comparisons with prior works, in my opinion, the authors have nicely detailed the differences in the rebuttal, which I suggest they include in the revised manuscript. Said that I feel most concerns have been addressed. Although there are some unclear implementation choices for some baselines, this work presents technical novelties for few-shot prompt learning in MIL, and it is a timely research given the current rise of histology vision-language models. The experimental setting will promote further benchmarks for better few-shot adaptation of such models in MIL. Thus, I keep my initial recommendation: Accept. If the paper is accepted, I encourage the authors to open-access the code.

    [a] Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models, CVPR23 [b] A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models, CVPR24




Author Feedback

Summary: R1(A),R3(WR),R4(R),R5(WA). We thank the reviewers for the insightful feedback and are pleased they find our work novel(R1,R3,R5), well-motived and relevant (R5,R4). The main concerns include clarifying method details and discussing related baselines in literature, which we address below:

(R1,R3,R4)’’Linear Probe (LP) details and the impact of hyperparameter-K”: To clarify, LP is modeled as a single learnable linear (randomly initialized) with dimensions D x D that projects image features (x^D) to D prior to l2-norm for similarity estimation with frozen text-prompts. For pooling, both Max and AbMIL were employed on image features prior to feeding inputs to LP. Indeed, we verified the impact of K prompt-pairs (i.e., K=2,4,8,16) and observed both Ours and CoOp are less sensitive with marginal performance gains (+1%) in the 16-shot setting.

(R4)’’Clarify core contributions and motivation”: Our study aims to address efficient transferability of vision-language (VL) models in the low-data setting as opposed to full fine-tuning that requires large-scale data, especially in noisy labeled scenario for whole slide images (Sec.1 Paragraphs.1 & 3). Herein, we highlight the importance of synergistic adaptation of both language and visual features, without degrading the inherent zero-shot ability of VL models (Sec.1 P4).

(R4)’’Reconsider Eq.(5) & paper organization”: To clarify, Eq.(5) is a standard operation in VL literature for computing image-text similarity (logits) with softmax; used here for instance re-weighting. In addition, Alg.(1) provides a high-level summary of our training procedure for clarity, and the use of two (+Ours in Tab.3) was intended to show the impact of context-pooling. All typos will be checked and corrected accordingly.

(R3,R5)’’Differences with prior works and missing comparison.’’: We agree our context pooling and that of Qu et al[19, TOP] could be viewed as similar but differ in three core aspects: (i) no use of ChatGPT to extend class description knowledge, (ii) TOP employs separate prompt learners for slide- and instance-levels (with diff templates), whereas we only used a single learner, and (iii) using a contrastive loss without additional objectives as in TOP[attention loss]. Our approach is more distinct and efficient as prompt optimization does not require multi-loss balancing between separate models. Moreover, comparison with TOP was difficult as their code was unavailable at submission time, and PromptMIL mainly addresses visual prompting i.e., prefix prompting, different from our study and recent literature that shifted to textual adaptation and post-visual prompting.

(R3,R5)’’Comparison fairness and the credibility of AbMIL”: In all experiments, text context vectors (K=4), few-shot data-splits/hyper-params were fixed for all methods with default settings of prior works for fair comparison [Sec.3]. In regards to AbMIL, our evaluation show it as a strong upper-bound when trained with (%100 data)[Tab. 1,2], but was limited in the few-shot regime. To clarify, results in Fig.2 are based on 16-shot models indicating severe overfitting for AbMIL (w/ high FPR). Moreover, while including more complex MIL approaches (DTFD-MIL, DSMIL) would be beneficial; we believe gains from leveraging different approaches can be exponential, and equally require method re-design.

(R5)”Readability of Tab. 1 and interpretation of learned prompts (tSNE)”: We will update Tab.1 with only AUC , including incorrect references as suggested. We agree that showing visually learned words would be beneficial, however note that as reported by CoOp and recent works, learned virtual context do not necessarily map to legible words and is left to future works including more baselines with different pathology VL models.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The 2 reviewers who gave an initial score of <4 have revised their rankings upwards to an accept so consensus is accept. This is a highly relevant topic.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The 2 reviewers who gave an initial score of <4 have revised their rankings upwards to an accept so consensus is accept. This is a highly relevant topic.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    NA

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NA



back to top