Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Automated pathological image classification remains a critical challenge, particularly due to the scarcity of annotated data and the complexity of disease-specific features. Existing methods, such as CLIP-based prompt tuning, struggle with limited few-shot learning and poor integration of multimodal information in medical contexts. In this study, we introduce PATE (Prompt-based Adaptation for Text-Image Embedding), a novel framework to enhance CLIP’s adaptability for few-shot pathological image classification. Our approach incorporates deep learnable prompts in both vision and language encoders, enabling effective use of visual and textual information. We also propose a dynamic bridging function for bidirectional information exchange and a Gaussian-weighted Prompt Integration (GPI) strategy to adjust prompt contributions across epochs, enhancing generalization and reducing overfitting. Extensive experiments on the PatchGastric dataset, which includes 179,285 histopathological patches across three gastric adenocarcinoma subtypes, demonstrate that PATE consistently outperforms state-of-the-art methods, achieving superior performance in both low-data and full-data settings. Ablation studies validate the effectiveness of each component, marking a significant advancement in few-shot medical image analysis, particularly in rare disease diagnosis and digital pathology workflows.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2503_paper.pdf

SharedIt Link: https://rdcu.be/eHdT9

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04978-0_47

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{CheShe_PATE_MICCAI2025,
        author = { Chen, Shenghao AND Huang, Zhen AND Zhou, Xiaoqian AND Li, Han AND Wang, Chunjiang AND Zhou, S. Kevin},
        title = { { PATE: Enhancing Few-Shot Pathological Image Classification via Prompt-Based Text-Image Embedding Adaptation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {491 -- 501}
}

Reviews

Review #1

Please describe the contribution of the paper

The authors propose an advanced architecture to tune CLIP models to pathology WSI classification tasks. They show improvements in performance over related work in a three-class classification task.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed architectural advances appear reasonable: The authors advance upon CITE, which enriches CLIP by learnable image prompts, mainly by also introducing learnable language prompts as well as alignment between language- and image prompts. Their approach, PATE, outperforms CITE in a small-sized evaluation study on a single dataset.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Though the high-level idea is intuitive, the formal notation of the proposed approach appears flawed and thus it remains unclear how precisely the ideas are captured. In particular: – The language prompt injection mechanism remains unclear; it seems a learnable prompt is injected in the first J transformer layers, but Eq 2 suggests the prompt (with identical notation) is also an output of the transformer block. On a related note, Eq 2 and 3 are identical, albeit from the text it reads as if learnable prompt injection stops at layer J. Furthermore, it remains unclear how b relates to J. Please clarify. – The vision prompting mechanism remains similarly unclear; Please find a notation that distinguishes learnable injected prompts from prompt token outputs. Also, it seems i should be _{i-1} in Eq 6, and same for j in Eq 7, no? – Likewise, as is, it reads as if Eq 8 could be condensed into merely stating \tilde{P}_i = F{i-1}(P_{i-1}) – The description of Gaussian-weighted prompt integration exhibits similar unclarity of notation

Besides lack of formal rigor, the paper does not discuss nor benchmark against important related work, namely on vision-only foundation models for pathology image analysis, see e.g. [1]. Also, the dataset used for evaluation is very small compared to benchmarking efforts on WSI data (as e.g. also in [1]). The work could be made much stronger by trying PATE vs FMs in suitable scenarios from [1] and other recent benchmarking efforts.

[1] Neidlinger, P., El Nahhas, O. S., Muti, H. S., Lenz, T., Hoffmeister, M., Brenner, H., … & Kather, J. N. (2024). Benchmarking foundation models as feature extractors for weakly-supervised computational pathology. arXiv preprint arXiv:2408.15823.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

– Flawed notation – Limited evaluation given larger benchmarking efforts in the community – No discussion of, nor benchmarking against, related work on pure vision foundation models + fine-tuning
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

in their rebuttal the authors have clarified formal notation and considerably extended their benchmarking

Review #2

Please describe the contribution of the paper

The paper proposes PATE (Prompt-based Adaptation for Text-Image Embedding), a novel framework for few-shot pathological image classification. Its key contributions include: Hierarchical Multimodal Prompting Strategy (HMPS): Introduces deep learnable prompts in both vision and language encoders to enhance cross-modal feature alignment. Bridging Function (BF): Facilitates dynamic interaction between visual and textual prompts to enhance vision-language representation alignment and improve classification accuracy. Gaussian-weighted Prompt Integration (GPI): Dynamically adjusts prompt contributions across training epochs using Gaussian weights to enhance few-shot learning and cross-domain adaptation in medical imaging tasks. Experiments on the PatchGastric dataset show PATE outperforms baselines (e.g., 59.3% vs. 39.1% accuracy in 1-shot settings).
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Novelty: First to jointly optimize vision and text prompts in medical image classification. The bidirectional bridging mechanism addresses limitations of previous methods (e.g., CITE).
- Technical Depth: The Gaussian-weighted Prompt Integration (GPI) strategy enhances robustness by adaptively prioritizing task-relevant prompts from intermediate training epochs using Gaussian-distributed weights.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Limited Generalizability: Validation is restricted to a single dataset (PatchGastric). Cross-dataset testing (e.g., TCGA) is missing.
- Technical Ambiguity: Insufficient details on the Learnable Context’s implementation, which may hinder reproducibility.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper introduces PATE, a novel framework that advances few-shot pathological image classification through multimodal prompting and Gaussian-weighted prompt integration (GPI), achieving significant accuracy gains. Its methodological innovation and rigorous validation on PatchGastric (ablation studies, few-/full-data scenarios) justify acceptance. Limitations include dataset specificity and technical ambiguity , but its technical novelty aligns with MICCAI’s focus on scalable medical AI. Rebuttal addressing these points could strengthen impact.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper proposes a novel framework that enables adaptation of CLIP-like models (which learn via natural language supervision) to pathology image classification. This is useful because due to the scarcity of annotated data and visually complex, disease-specific features. This paper provides a useful methodology and interesting insights which may enable development of similar frameworks for other medical imaging application areas, all of which suffer from scarcely annotated data.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Below are some of the major strengths of this paper:
1. This paper is structured and written well. It also provides a good review of the current state-of-the-art in multimodal learning in the context of pathology.
2. A number of novel ideas such as bridging function (BF) and gaussian weighted prompt integration (GPI) are investigated to enhance learning between the text and image modalities. The performance gains introduced by each block is demonstrated via an ablation study.
3. Results demonstrate that the proposed approach outperforms SOTA (natural language supervised) models, although performance comparison with conventional (uni-model) image classification models is missing.
4. Graphical illustrations are effective and clearly convey the meaning of the text.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Below are some minor weaknesses of this paper.
1. Performance is not compared with conventional image-only models. I understand that CLIP models outperform those models in natural image classification tasks but I am not sure whether the same can be claimed for medical applications where data is limited.
2. In section 2.2 it is not very clear how vision prompts are initialized. The last line of section 3 states “The text prompts follow the template “a photo of a ,” while all other model parameters are randomly initialized from a normal distribution." Does this mean that vision prompts are randomly initialized? If yes, then I recommend adding a line indicating this in section 2.2 (vision prompting sub-section).
3. Equation (8) doesn’t seem very clear. The text preceding it says that language prompt is concatenated with the previous vision embeddings. It isn’t very clear where/how the concatenation takes place. I suggest reviewing/rewording for clarity. I found eqns (5) and (8) to be somewhat confusing.
4. I am somewhat sceptical about adding tuneable prompts to both vision and language encoders. Is there any impact of this in terms of overfitting? May be the authors can address this in an extension of their work.
5. First line of page 7, “The length of both language and vision prompts is fixed at 2.” This sentence seems somewhat vague. I suggest rewording/reviewing for clarity purposes.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors have tackled an interesting problem and provided a coherent solution. Performance is compared against established baselines of similar models. The experimental details are clear and based on an opensource dataset. The proposed approach has the potential to be extended to other applications in medical imaging as well.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

First, we sincerely thank the reviewers for their constructive feedback and the Area chair for their time and effort in handling our submission!

R1-Q1: Lack of image-only model comparisons. Thank you for your positive feedback and thoughtful suggestions. We have added comparisons with image-only models ViT, DinoSSLPath, and Virchow2 on PatchGastric (16-shot). PATE achieves 71.2% accuracy, compared to ViT (62.5%), DinoSSLPath (64.1%), and Virchow2 (65.3%). These results validate the advantage of our multimodal prompting framework in few-shot scenarios.

R1-Q2: Vision prompt initialization. Vision prompts are initialized using a truncated normal distribution (mean=0, std=0.02), consistent with CLIP. This ensures stable convergence without introducing prior semantic bias. We will clarify this in Section 2.2.

R1-Q3: Eqns (5)/(8) and concatenation confusion. We will revise the notation to clearly indicate that prompts are concatenated with patch embeddings along the sequence dimension before the transformer input. A schematic in Fig. 2 and additional comments will aid clarity.

R1-Q4: Overfitting concern. Our GPI strategy mitigates overfitting by emphasizing mid-epoch prompts. For instance, removing GPI leads to a 1.5% drop in 1-shot accuracy. We will clarify this in the revision and explore additional regularization (e.g., dropout) in future work.

R1-Q5: Prompt length wording. Thanks for pointing this! we reworded it to: “We use 2 learnable prompt tokens for both modalities at each injected layer.”

R2-Q1: Language prompt mechanism unclear. We appreciate your careful review and critical insights. (1) Eq. (2) describes the update of tokens and prompts across transformer layers; the prompt is not the output but is passed through. (2) Eq. (3) applies post-injection (i > J). We’ll revise the notation to distinguish this. (3) b is the number of prompt tokens; J is the injection depth. This will be clarified.

R2-Q2: Vision prompting is unclear. (1) Prompt tokens are injected inputs, not outputs. (2) Thank you — indices in Eq. (6)/(7) should be i–1/j–1. We have fixed these.

R2-Q3: Eq. (8) simplification. Thanks for pointing this, We will simplify Eq. (8) to P̃ᵢ = Fᵢ₋₁(Pᵢ₋₁) and revise the description for brevity and clarity.

R2-Q4: Gaussian-weighted prompt integration notation is unclear. Thanks for pointing this out. We will simplify the explanation and clarify the notation in Section 2.3. A figure will also be added to improve readability.

R2-Q5: Missing vision-only foundations for pathology. We now compare our method (PATE 71.2%) against ViT (62.5%), DinoSSLPath (64.1%), and Virchow2 (65.3%) on PatchGastric. PATE shows consistent improvements due to multimodal synergy. Notably, Virchow2 is not a pure vision-only model, so our primary image-only baselines are ViT and DinoSSLPath.

R2-Q6: Dataset extension. Thanks, we have extended our experiments, and the results support the effectiveness of our method. We will add the evaluation results of our methods based on a larger cohort: TCGA-STAD(443 cases) to strengthen the experimental evidence.

R3-Q1: Cross-dataset evaluation. Thank you for your positive feedback and thoughtful suggestions. We have added a weakly-supervised cross-dataset evaluation using TCGA-STAD(443 cases) as an external test set. Preliminary results indicate PATE maintains performance advantages, reinforcing its generalizability.

R3-Q2: Details on Learnable Context. Learnable context tokens are initialized with natural-language descriptions (e.g., “a photo of a poorly differentiated adenocarcinoma”) and optimized jointly with the transformer. They are injected layer-wise into the text encoder. Section 2.2 has been expanded to include this.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

All reviewers agreed to accept the manuscript, and the rebuttal addressed most reviewers’ doubts.

back to top

PATE: Enhancing Few-Shot Pathological Image Classification via Prompt-Based Text-Image Embedding Adaptation

Author(s):