Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Multiple Instance Learning (MIL) has advanced WSI analysis but struggles with the complexity and heterogeneity of WSIs. Existing MIL methods face challenges in aggregating diverse patch information into robust WSI representations. While ViTs and clustering-based approaches show promise, they are computationally intensive and fail to capture task-specific and slide-specific variability. To address these limitations, we propose PTCMIL, a novel Prompt Token Clustering-based ViT for MIL aggregation. By introducing learnable prompt tokens into the ViT backbone, PTCMIL unifies clustering and prediction tasks in an end-to-end manner. It dynamically aligns clustering with downstream tasks, using projection-based clustering tailored to each WSI, reducing complexity while preserving patch heterogeneity. Through token merging and prototype-based pooling, PTCMIL efficiently captures task-relevant patterns. Extensive experiments on eight datasets demonstrate its superior performance in classification and survival analysis tasks, outperforming state-of-the-art methods. Systematic ablation studies confirm its robustness and strong interpretability. The code is released at https://github.com/ubc-tea/PTCMIL.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3145_paper.pdf

SharedIt Link: https://rdcu.be/eG4DZ

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05182-0_49

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/ubc-tea/PTCMIL

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ZhaBei_PTCMIL_MICCAI2025,
        author = { Zhao, Beidi AND Kim, SangMook AND Chen, Hao AND Zhou, Chen AND Gao, Zu-hua AND Wang, Gang AND Li, Xiaoxiao},
        title = { { PTCMIL: Multiple Instance Learning via Prompt Token Clustering for Whole Slide Image Analysis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15974},
        month = {September},
        page = {502 -- 512}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper introduces prompt token in ViT, integrates clustering and WSI-level analysis end-to-end, and PTCMIL improves performance while reducing computational overhead. The experiment is conducted in multiple datasets and multiple tasks to show the performance, and the visualization of WSI is provided by the model.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The main strengths of this paper are as follows:
1. The experimental part of this paper covers a variety of different tasks, including classification, survival prediction and even few-show learning. A large number of experiments increase the credibility and effectiveness of the method.
2. This method obtains tokens based on clustering to reduce complexity, which is a feasible solution.
3. The paper provides code, which makes it easier for paper readers to follow their work. It is important to the community.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Although I have listed the main advantages of the paper, there are still parts for improvement in this paper, as follows:
1. The model architecture diagram does not seem to match the text. The meaning of “sec 3.1, sec 3.2, sec 3.3” at the top of Figure 1 is unclear. Also, please forgive me for not reading carefully enough, I did not find the exact location of the “orange block” that appeared on page 5.
2. The introduction of the method section needs to be described in more detail so that readers can understand it more accurately. For example, the calculation method of f_local is not described in detail.
3. According to the author’s experiments, only 5 clusters need to be clustered to achieve the best results. As far as I know, some other works have more patch classifications than 5 categories.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

In general, the experiments are sufficient, and many tasks are covered. However, considering that there are still defects in the method and the experimental part, further improvement is needed, so no higher score is given.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The methodology presented in this paper is interesting and the author’s RESPONSE answered most of my questions, although there are some oversights in details, but overall it is up to the MICCAI acceptance criteria

Review #2

Please describe the contribution of the paper
1. Proposes PTCMIL, a novel ViT-based MIL framework with prompt-guided soft clustering.
2. Introduces an end-to-end strategy that jointly learns clustering and downstream tasks.
3. Designs a prototype aggregation mechanism using local Transformer and learnable weights.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The method is conceptually clean and integrates prompt learning with MIL effectively.
2. Achieves strong results across classification and survival prediction tasks on 8 datasets.
3. Provides visualizations and ablation studies supporting interpretability and robustness.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Although the paper refers to the method as “clustering,” it essentially performs a soft assignment via learnable prompt tokens (projection + weighting), rather than traditional clustering. In Fig. 2, the clustering results of PANTHER and PTCMIL differ significantly. However, the paper lacks qualitative evaluation, making it difficult to assess whether the prompt-based clustering achieves better semantic consistency than traditional clustering or MIL approaches.
- While the method claims improved efficiency, the paper does not report memory usage, training time, or model size.
- Table 4 includes ablation studies on pooling and merging, but does not explore the impact of the number of prompt tokens (i.e., cluster number C) or the effectiveness of the regularization loss in preventing prompt collapse.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper proposes a promising prompt-guided MIL framework with strong results across multiple datasets. However, it lacks qualitative evaluation of the clustering quality and key experimental details such as efficiency and ablation on prompt-related components.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The rebuttal effectively addresses my concerns on clustering validity and efficiency, shifting my evaluation toward acceptance.

Review #3

Please describe the contribution of the paper

This paper introduces a novel Prompt Token Clustering-based ViT. They evaluate the proposed method across eight datasets in multiple downstream tasks, achieving SOTA performance.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The writing and overall presentation of the paper are excellent. - The paper is well-organized and clearly written.
- The proposed framework is thoughtfully designed, with each component specifically addressing the challenges inherent in the task. Notably, the use of a learnable prompt token-based clustering method and prototype merging over clusters are well-justified and directly tackle key issues.
- Each element within the framework is reasonable and contributes to the overall effectiveness of the solution.
- The experimental validation is excellent. The authors compare the proposed method with many well-known and SOTA models.
- Additionally, by extracting features using both CTransPath and UNI, the experiments demonstrate that the proposed method consistently outperforms these alternatives across different feature extraction approaches.
- The usage of a diverse dataset that encompasses multiple cancer types and targets various downstream tasks further underscores the versatility and robustness of the method.
- The idea of visualization is good. The Fig.2 improve the interpretability.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The computational cost is not reported. I am only concerned about the cost since the proposed method uses end-to-end manner and VIT.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

See strengths. Very good experiments and visualizations.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We sincerely thank all three reviewers for their constructive feedback, and encouraging remarks. Below is a summary of recognized strengths:

Strong multi-task experiments (R1, R2, R3)

SOTA results on diverse datasets (R2, R3)

Clean writing and organization (R2, R3)

Reproducibility (R1)

Effective visualizations and ablations (R2, R3)

R1 – Clarification Focused Q1 (Fig 1): Thank you for your careful reading. Fig. 1 was not updated to reflect Sec 3.1–3.3. We sincerely apologize for this oversight. We will revise the figure and correct all references. The “orange block” refers to “Token Merging” (p.5), with the corresponding formula immediately below.

Q2 (ƒ_local): We use a standard Transformer layer as f_local, which follows prior ViT-based MIL works [17, 23]. The “local” refers to the layer applied to the patches in each cluster. We will clarify this in our revision.

Q3 (Cluster number): We noticed that different previous works [19,28] used fewer or more cluster numbers even for the same datasets. Empirically, our model performs robustly within a reasonable range of cluster numbers (e.g., Fig. 3). Within this range, fewer clusters reduce computational cost and indicate the power of PTCMIL to learn the heterogeneous patterns efficiently and effectively.

R2 – Technical Clarification and Evidence Q1(Clustering definition & qualitative analysis): (1) We refer to ‘clustering’ as a broad definition that divides data into two or more groups without explicit group labels, and our projection + assignment methods share similarities with many classical clustering methods, e.g., projection-based clustering methods (e.g., Principal Direction Divisive Partitioning (PDDP), Spherical k-means, Self-Organizing Maps (SOM)), but differ by learning projection vectors end-to-end through a ViT. Also, we clarify that we performed a hard assignment via a_i = arg max(A_i \dot) (Sec. 2.2). Thus, each patch is assigned to exactly one cluster. This differs from soft attention, where all instances contribute to all clusters. (2) Fig. 2 shows a qualitative comparison with PANTHER. PANTHER shows clustering collapse (homogeneous colors, poor tissue separation), while PTCMIL produces more structured maps, better reflecting local heterogeneity. We will improve the explanation around Fig. 2 to emphasize this.

Q2 (Efficiency claim): We appreciate the opportunity to highlight our clustering strategy’s efficiency. Let’s denote the number of patches per WSI as N, their embedding dimension as d, and the number of clusters as C. The computational complexities of our and other clustering methods are:

PTCMIL: O(CNd)

K-medoids (example of a traditional clustering method): O(CN^2)

K-means (used in CLAM [14]): O(ICNd), with I iterations

GMM (used in PANTHER [19]): O(ICN_all d^2), where N_all refers to patches from all WSIs. In WSI, N is typically large (e.g., 10K), and with a large sample size, we have N_all » N » d » C. On our hardware, alternative methods may run OOM without patch downsampling, which our method avoids. We will explicitly define this scope in the revision. Additionally, PTCMIL achieves high performance efficiency. It adds only ~10K parameters to a 2-layer ViT (1.58M → 1.59M) while yielding accuracy gains up to +0.98%, reflecting a favorable performance–efficiency tradeoff.

Q3 (Hyperparameters): (1) Fig. 3 explores the effect of varying cluster numbers. (2) We reported key ablations for conciseness. As for α, we also varied it within a range (e.g., AUC and Acc on TCGA-NSCLC with CTransPath features):

α=0: 97.10%, 91.70%;

α=0.1: 97.17%, 92.18%;

α=0.2: 97.18%, 92.18% (best);

α=0.3: 97.13%, 91.64%. These results confirm that the regularization term improves prompt stability and performance.

R3 – Accepted, mild Concern We thank R3 for the positive feedback on our method, experiments, and visualizations. Regarding your comment on efficiency, please refer to our response to R2Q2 for the complexity justification.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

All reviewers praised the strong experiments and performance but initially flagged figure inconsistencies, unclear method details, and missing efficiency metrics. In the post-rebuttal comments, reviewers confirmed that their concerns were addressed and endorsed acceptance. Since most substantive issues have been resolved, I recommend accepting the paper. The authors are encouraged to include these clarifications in the final version of the manuscript.

back to top

PTCMIL: Multiple Instance Learning via Prompt Token Clustering for Whole Slide Image Analysis

Author(s):