Abstract

Digital pathology images are not only crucial for diagnosing cancer but also play a significant role in treatment planning, and research into disease mechanisms. The multiple instance learning (MIL) technique provides an effective weakly-supervised methodology for analyzing gigapixel Whole Slide Image (WSI). Recent advancements in MIL approaches have predominantly focused on predicting a singular diagnostic label for each WSI, simultaneously enhancing interpretability via attention mechanisms. However, given the heterogeneity of tumors, each WSI may contain multiple histotypes. Also, the generated attention maps often fail to offer a comprehensible explanation of the underlying reasoning process. These constraints limit the potential applicability of MIL-based methods in clinical settings. In this paper, we propose a Prototype Attention-based Multiple Instance Learning (PAMIL) method, designed to improve the model’s reasoning interpretability without compromising its classification performance at the WSI level. PAMIL merges prototype learning with attention mechanisms, enabling the model to quantify the similarity between prototypes and instances, thereby providing the interpretability at instance level. Specifically, two branches are equipped in PAMIL, providing prototype and instance-level attention scores, which are aggregated to derive bag-level predictions. Extensive experiments are conducted on four datasets with two diverse WSI classification tasks, demonstrating the effectiveness and interpretability of our PAMIL.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1022_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/Jiashuai-Liu/PAMIL

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Liu_PAMIL_MICCAI2024,
        author = { Liu, Jiashuai and Mao, Anyu and Niu, Yi and Zhang, Xianli and Gong, Tieliang and Li, Chen and Gao, Zeyu},
        title = { { PAMIL: Prototype Attention-based Multiple Instance Learning for Whole Slide Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces a prototype attention-based MIL for WSI classification. This method employs prototype learning to tackle the challenges associated with multi-tumor labels and the limitations of the attention mechanism in terms of interpretability. One of the advantages of this method is that it offers superior interpretability compared to the attention mechanism. While the latter merely highlights the significant regions, it does not provide the underlying reasoning procedures. In contrast, this work not only identifies important regions but also elucidates the rationale behind their selection. The efficacy of the method is shown by the experimental results. Furthermore, the ablation studies validate each component of this module.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors present a prototype attention-based MIL method in this paper. This approach notably enhances interpretability, a significant improvement over the traditional use of the attention mechanism alone.
    2. The experimental results provided are compelling, demonstrating that the proposed method has achieved state-of-the-art results across almost all multi-label classification tasks.
    3. The paper is well-structured and clearly written, making it easy to follow.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The results presented in the paper do not convincingly demonstrate the effectiveness of the proposed method. While the method appears to perform well in multi-label scenarios, it does not achieve state-of-the-art results in single-label scenarios.
    2. The methodology section describes a two-step clustering process, where instances are initially clustered into 10 centroids, which are then further clustered into 8 centroids. This approach raises questions about its necessity and effectiveness. Clustering a small number of centroids (10) into an even smaller number (8) could potentially introduce randomness and may not yield significant benefits. Could you please provide further clarification or justification for this two-step clustering process?
    3. The paper differentiates between the instance branch and the prototype branch, but it is not entirely clear why these branches are treated differently. Given that both branches ultimately produce scores for instances or prototypes, it seems plausible that they could be treated similarly. Could you please elaborate on the specific reasons for this differentiation? Understanding the rationale behind this decision could provide valuable insights into your methodology.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. To enhance readability and understanding, I recommend denoting matrices and vectors using bold fonts and clearly indicating their dimensions. This would provide a more straightforward visual cue for readers and help them follow your mathematical formulations more easily.
    2. The nature of Y_proto and Y_inst is not entirely clear in the current version of the paper. Are they intended to represent integer labels, probabilities, or logits? If they are logits, the use of KL divergence might not be appropriate. I suggest providing a clear explanation of these variables to avoid any potential confusion.
    3. The naming and explanation of the instance/prototype branches could be clarified. In the instance branch, the calculation is denoted as A * T, where T is the embedding for prototypes. Conversely, in the prototype branch, the calculation is S * H, where H is the instance embedding. This seems somewhat counterintuitive, and I may have misunderstood something. Could you please provide further clarification on this matter?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. While the paper shows promising results in the multi-label scenario, it does not achieve state-of-the-art performance in the single-label scenario. In fact, it is outperformed by previous methods. It would be beneficial to delve deeper into this discrepancy and perhaps explore ways to improve the performance in single-label scenarios.

    2. Although the paper is generally easy to understand, it lacks a clear motivation for the architectural design, as mentioned in weaknesses part. Providing a more detailed rationale for your design choices would not only strengthen the paper but also make it more accessible to readers.

    3. There are certain aspects of the paper that are vague and counterintuitive. I recommend revisiting these sections to ensure clarity and logical coherence. Providing additional explanations or examples could be helpful in this regard.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors have mainly addressed my concerns.

    However, I still remain my opinion on the two-step clustering. Although the authors have explained that two-step clustering is more effective than one-step, it does not make sense to cluster 10 centroids to 8 centroids. To me, a more making-sense way is to cluster like 500 centroids to 8 centroids. I hope the authors could think about this point also when drafting their final manuscript if the paper gets accepted.



Review #2

  • Please describe the contribution of the paper

    The authors propose PAMIL, an interpretable method for MIL classification of WSIs. They merge attention mechanisms with prototype learning, thereby achieving the ability to measure the similarity between prototypes and instances, providing interpretability at the instance level. By utilizing two branches, one for prototypes and another for instances, they did not compromise performance. They obtained a consistent improvement compared to the state of the art and provided visualizations of the interpretability of the method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Interest: The method is interesting as it addresses a hot research topic in digital pathology, the interpretability of MIL methods. It further enhances the advantages of ab-MIL and protoMIL by incorporating two branches.
    • Results: The results are consistent, and the comparison includes the most popular methods for MIL in WSIs. Furthermore, the ablation study is comprehensive, and the authors utilize several datasets covering various tasks (multi-label and multi-class).
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Lack of comparison in visualization/interpretability: While the performance is compared to other methods, the visualization and interpretability are not compared. Since interpretability is a core aspect of the method, it should be further assessed.
    • The method should be further explained: It is difficult to fit all the details into an 8-page paper, but some details make it difficult to understand the method fully. For example, what is the KL divergence between Y_proto and Y_inst? There are no distributions specified for these variables. I assume that it is a categorical distribution, but is this appropriate for multi-label classification? The method section would benefit from fully specifying the dimensions of the parameters, weights, and variables. I believe that there are some mistakes, for example, “Both P and V are then processed by a dimensionality reduction layer g() to get compressed embeddings T for prototypes and H for instances.” It should be the reverse, right?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1.The method is interesting and has great application to WSIs. My main concern is with the mathematical notation and method details. I need more details.

    1. I understand that the prototypes are initialized with k-means and then optimized. How are the prototypes associated with categories and the actual instances of the dataset? Can the authors provide further details on the function g(·)?
    2. Does w_c represent a weight per class? What is the significance of the sub-index? Can the authors clarify the dimensions of the weights and functions in the framework?
    3. Which distribution is assumed for Y? How is the KL divergence computed?
    4. I recommend for future work to compare the visualizations with AB-MIL or protoMIL. Do the authors believe that this dual-stream framework improves interpretability or only performance?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method is interesting and the results are convincing. However, the method description lacks details and some notation is misleading.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I would like to acknowledge the effort of the authors on the manuscript and in the rebuttal. The idea behind the paper is very interesting and promising for future systems. The interpretability provided by the two branches can enhance the assessment and fairness of the model when deployed in practical scenarios. The authors’ rebuttal has addressed most of my comments. However, I encourage the authors to revise the notation and method description.



Review #3

  • Please describe the contribution of the paper

    This paper introduces a multiple instance learning approach that merges prototype learning with attention mechanisms to quantify the similarity between prototypes and instances, thereby providing interpretability at the instance level for whole slide image (WSI) analysis. Prototype and instance branches are proposed to obtain the representative WSI features for classifying the WSI with only slide-level label provided.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper’s strength lies in its innovative use of prototype and instance-level attention scores within a two-branch multiple instance learning framework, effectively addressing the need in computational pathology for robust feature representation learning and classification despite weak labels.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The application of this model on multiclass WSI images with extensive non-cancerous regions is still challenging an dless convicing, as it may lead to misinterpretations or biased classifications due to the dominance of non-tumor features.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    There are some confusing parts that need further clarification.

    1. Since the model has two branches of predictions, how is the final prediction determined?
    2. The design of this model may lead to a focus on non-cancerous features if the dataset contains numerous slides with extensive non-tumor regions.
    3. Is the pre-trained encoder frozen during the entire training process, or is it updated with loss backpropagation every epoch? Additionally, the authors noted that prototypes are refined during training, but updating them with k-means clustering may be more reasonable. 4.Is there a specific reason for resizing the patches to 512x512 instead of directly cropping them to 512x512?
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    There are some confusing parts that need further clarification.

    1. Since the model has two branches of predictions, how is the final prediction determined?
    2. The design of this model may lead to a focus on non-cancerous features if the dataset contains numerous slides with extensive non-tumor regions.
    3. Is the pre-trained encoder frozen during the entire training process, or is it updated with loss backpropagation every epoch? Additionally, the authors noted that prototypes are refined during training, but updating them with k-means clustering may be more reasonable. 4.Is there a specific reason for resizing the patches to 512x512 instead of directly cropping them to 512x512?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The experiments are less convincing.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Even though the authors have addressed all the comments made, it still remains less convincing.




Author Feedback

Thank you for your thorough review and valuable feedback on our manuscript. Below, we provide a detailed response to address the concerns raised.

  1. Comments about Interpretability: The reviewers mentioned the lack of interpretability comparison. Response: Compared to conventional attention-based MIL methods, our PAMIL offers both attention score and prototype score as explanations. The interpretive process is analogous to that of a pathologist, who diagnoses by comparing key areas of the slide to typical patterns, as shown in section 4.2.

  2. Comments about Probabilistic Prediction and KL Divergence: The reviewers have doubts about the Y_proto and Y_inst, as well as the use of KL divergence for regularization loss. Response: In our paper, Y_proto and Y_inst represent probability values from two prediction branches, with the final model prediction being their average. KL divergence measures the difference between probability distributions (not the original logits) and is used to maintain consistency between predictions from two branches. Ablation experiments validate this approach. For multi-label classification, KL divergence is applied to each label’s probability predictions independently.

  3. Comments about Necessity and Effectiveness of Two-step Clustering Process. Response: In our paper, we utilize a two-stage clustering method to initialize prototypes, ensuring alignment with the distribution of slide patches and thereby enhancing training stability. Since prototypes are updated during training, any initial randomness does not significantly impact the overall process. Moreover, this approach is more efficient than using a single k-means, as evidence by Yu et al. (2023) [17].

  4. Comments about Optimization of Prototypes and Meaning of w_c Matrix: The reviewers doubt why not updating prototypes with k-means clustering, how prototypes relate to categories and actual patches and how does the matrix w_c work. Response: Prototypes’ optimization is in three stages: backpropagation, projection operation to replace prototypes with the most similar instances (link to actual patches) and fine-tuning other parameters. The w_c matrix (dimensions n_class×n_proto) captures the relationship between prototypes and categories (section 3.2). This category relationship cannot be learned by using k-means clustering optimization prototypes.

  5. Comments about Differentiation Between the Instance Branch and Prototype Branch Response: We designed the instance and prototype branches separately to aggregate features from both prototypes and instances as shown in section 3.2. This enhances model performance and provides dual aspects of explanation. Since the number of prototypes is fixed while the number of instances varies per slice, we design different aggregation methods for each branch.

  6. Comments about Effectiveness in Multi-Class Scenarios: The reviewers pointed out that the method did not achieve SOTA on the multi-class classification task. Response: There are two reasons for not achieving SOTA on the multi-class task. First, current SOTA methods have excellent performance (More than 90%), making further improvements difficult. Our t-test revealed no significant difference between our PAMIL and SOTA methods. Second, this paper prioritizes model interpretability and demonstrates significant improvement in more challenging scenarios (multi-label).

  7. Comments about the potential to focus on Features of Non-cancerous Regions. Response: The attention score makes the model pay more attention to the category related features (cancer area). In addition, the visualization of the model also shows that the model will not pay too much attention to the normal area (section 4.2).

Finally, regarding other details, our pre-trained encoders and patch cropping follow established practices. We will rectify any typographical errors and clarify expressions in the final paper. Additionally, we plan to publicly release the code in a future update.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top