Abstract

Pre-trained visual-language (V-L) models have demonstrated impressive generalization capabilities on various downstream tasks, yet their performance is significantly influenced by the input text prompts. Previous studies (e.g., CoPrompt) have attempted to use detailed descriptions generated by LLM to assist model learning. For example, while a coarse-grained prompt like “A photo of Debris.” may be less informative, a fine-grained description such as “Debris consists of dead cells and matrix fragments.” provides additional context, resulting in enhanced model performance. However, existing methods generally lack the sensitivity to capture the subtle semantic differences that are crucial for accurately classifying pathology images. To tackle this challenge, we introduce PathoPrompt, a framework that leverages Cross-Granular Semantic Alignment to improve sensitivity to refine the model’s ability to capture subtle semantic variations in pathology image classification. Specifically, we introduce token-level fine-grained alignment, allowing the model to capture subtle differences that are crucial for accurate pathology image classification. Further, Cross-Granular Semantic Distillation improves the model’s ability to generalize by filtering out irrelevant information from both coarse and fine-grained prompts. Moreover, PathoPrompt employs a prototype-based cross-modal separation mechanism, promoting distinct class boundaries by separating image and text semantics for more effective multi-modal representation learning. Experiments on five pathology datasets and three different task types demonstrate that our method achieves superior performance compared to previous methods.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4278_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{HuaRun_PathoPrompt_MICCAI2025,
        author = { Huang, Runlin and Liang, Haohui and Cai, Hongmin and Zhuo, Weipeng and Fan, Wentao and Su, Weifeng},
        title = { { PathoPrompt: Cross-Granular Semantic Alignment for Medical Pathology Vision-Language Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    This study proposes PathoPrompt, a cross-granularity semantic alignment framework designed for pathology image classification, which aims to improve the sensitivity of visual-language models to subtle semantic differences. This method accurately models the intra-class detail changes in pathology images by introducing a token-level fine-grained alignment mechanism, and combines cross-granularity semantic distillation to extract valuable information from coarse-grained and fine-grained prompts to improve the generalization ability of the model. Furthermore, PathoPrompt introduces a prototype-based cross-modal separation mechanism to form clear category boundaries in the image and text semantic space, promoting more effective multimodal representation learning. Extensive experiments show that this framework significantly outperforms existing methods in multiple pathology datasets and tasks.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    PathoPrompt achieves three key improvements based on the original method: First, token-level alignment is used to improve the perception of subtle semantic differences in descriptions and adapt to the unstructured and fine-grained feature distribution in pathological images; second, the cross-granularity semantic distillation strategy effectively removes redundant information, making the model more robust and more generalizable when processing complex inputs; finally, the prototype-driven cross-modal separation mechanism enhances the semantic decoupling and inter-class distinction between image and text modalities, thereby further improving the accuracy and stability of cross-modal classification.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Limited innovation: The main improvement of this method comes from the introduction of more fine-grained text descriptions, which essentially uses additional information to assist model learning rather than fundamental innovation in model structure or learning mechanism. Therefore, its performance improvement depends largely on the quality of external descriptions, and the overall technical novelty of the method is weak.

    Unbalanced module effects and lack of explanation: The experiment shows that Prototype-based Dual-modal Separation (PDS) has a significant impact on class generalization, while Granularity-aware Semantic Distillation (GSD) is more conducive to cross-organ generalization. However, the author did not conduct an in-depth analysis or explanation of this phenomenon, which limits the interpretability and theoretical support of the internal mechanism of the method.

    Missing experimental details and insufficient reproducibility: There are many missing information in the experimental design of the paper. For example, the ablation experiment did not clearly indicate the dataset used, nor did it explain why the dataset was selected for verification, resulting in insufficient experimental persuasiveness. At the same time, the paper lacks analysis of key hyperparameters, and cannot evaluate the stability of the method under different settings, which affects readers’ judgment on the generalization ability and practicality of the method.

    There are some errors in the text, such as ptext, c is not defined

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    see weakness above

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper presents PathoPrompt, a novel framework leveraging cross-granular semantic alignment and prototype-based separation to enhance vision-language models for fine-grained pathology image classification.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The work addresses a clinically relevant challenge by improving sensitivity to subtle semantic variations in medical prompts. The methodology is well-motivated and innovative and experiments demonstrate competitive performance.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) The architecture overview (Fig. 2) should explicitly illustrate the inference pipeline. As well as adding arrows or annotations to depict the flow of data from input to output predictions. 2) Why the PDS term is subtracted rather than added in the loss function Eq 8 3) Table 3 suggests domain-specific templates improve performance, but the optimal template (“A tissue sample slide of {CLASS}”) is not used in main experiments. Align ablation settings with primary results. 4) Standardize “Accuracy” as “Acc.” in Table 3 to align with Table 1. Additionally, include F1 scores for all prompt templates to ensure consistency. 5) Provide additional details on dataset characteristics, including dataset scale, image resolutions, and preprocessing steps (e.g., normalization, augmentation), as well as detailed description of how to train the model.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A timely paper resolving an emerging challenge in the application of pretrained VL models to clinical practice. A well-motivated methodology.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I stick to my previous comments



Review #3

  • Please describe the contribution of the paper

    The main contribution of this paper lies in the introduction of PathoPrompt, a novel framework that enhances pathology image classification by leveraging Cross-Granular Semantic Alignment. Specifically, the authors propose a fine-grained, token-level alignment mechanism to improve sensitivity to subtle semantic variations in prompts, which is crucial in the medical imaging domain. Additionally, they introduce a Cross-Granular Semantic Distillation technique to filter out irrelevant information from both coarse and fine-grained prompts, enhancing model generalization.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Novel Use of Cross-Granular Semantic Alignment. Unlike existing methods that rely solely on either coarse or fine-grained textual prompts, PathoPrompt explicitly aligns semantics across multiple granularity levels (from phrase-level to token-level).

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Lack of training details. For example, the authors said they used 0.0035 as the learning rate. I am curious did they do any hyperparameter tuning to get this value?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Well-organized evaluations and robust method.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely thank all reviewers for their feedback and suggestions. We appreciate the recognition, such as: “the methodology is well-motivated and innovative” (Reviewer #1), “Outperforms existing methods” (Reviewer #2), and “paper introduces a novel framework” (Reviewer #3).

@Reviewer #1 Regarding Comment 7.1 (Figure Annotations): Thanks for the suggestion. Owing to and layout constraints, some arrows and legends in Figure 2 were not clearly annotated. We will revise the figure for better clarity.

Regarding Comment 7.2 (PDS Loss Explanation): The Prototype-based Dual-modal Separation (PDS) loss is designed to enforce greater inter-class separability in the embedding space. Specifically, the loss tries to push instances of different classes (e.g., Class A vs. Class B) apart. By minimizing this loss, the model learns to maximize semantic distance between class prototypes, thereby improving classification performance.

Regarding Comment 7.3 (Domain-Specific Templates): We agree that domain-specific templates such as “A tissue sample slide of {CLASS}” can improve accuracy in certain datasets. However, such templates may introduce dataset-specific bias and limit generalizability. Therefore, we adopt the more generic prompt “A photo of a {CLASS}” as the default template in our main experiments, ensuring consistency and genarability across diverse datasets.

Regarding Comments 7.4 & 7.5 (Dataset and Training Details): Our experiments are conducted across four pathology datasets: Kather (107,180 samples), Colorectal Histology (5,000), PanNuke (7,559), and KIMIA (960), covering both large-scale and low-data scenarios. Training settings adopt 224×224 resolution with standard augmentations (crop, flip, jitter, normalize).

@Reviewer #2 Regarding Comment 7.1 (Limited Innovation): While our method utilizes external prompts, its contributions extend beyond prompt refinement. Compared with CoPrompt—which also uses external descriptions—we incorporate a Cross-Granular Semantic Distillation (GSD) module that filters redundant information between coarse and fine-grained prompts, enhancing the model’s focus on core semantics. Additionally, we introduce a contrastive learning-inspired separation mechanism to promote inter-class distinction in the multimodal embedding space. Thus, the performance gains derive not only from prompt quality but also from novel training strategies and architectural enhancements.

Regarding Comment 7.2 (Interpretability): The PDS module enhances inter-class boundaries by clustering features of the same class and separating those of different classes, similar to contrastive learning. This structure promotes better discriminability in class generalization tasks. The GSD module retains the most informative components across prompt granularities while suppressing domain-specific noise. In cross-organ generalization tasks—where models trained on organ A are evaluated on unseen organs—GSD helps preserve essential semantic features while discarding organ-specific artifacts, leading to improved performance.

Regarding Comments 7.3 & 7.4 (Experimental Details): We apologize for the confusion caused by the lack of detail descriptions due to space limitations. To clarify: Table 3 reports class generalization results on the Kather dataset. Table 4 shows few-shot generalization results on the BloodMNIST dataset and cross-organ generalization on PanNuke. Extended hyperparameter analysis was omitted because of submission limits.

@Reviewer #3 Regarding Comment 7 (Learning Rate Setting): We follow the learning rate setting (0.0035) from prior works such as CoPrompt and MaPLe to ensure fair and consistent comparisons. While this may not be the optimal value for our method, we chose to maintain alignment with prior studies. We agree that further hyperparameter tuning could potentially yield better results, and we will explore this in future work.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Most of the reviewers’ concerns are addressed. The proposed method is indeed intuitive for addressing the granularity for semantic alignment. The authors are encouraged to polish the manuscript to include more explanations on the details for reproducibility.



back to top