Abstract

Slide-level classification for whole-slide images (WSIs) has been widely recognized as a crucial problem in digital and computational pathology. Current approaches commonly consider WSIs as a bag of cropped patches and process them via multiple instance learning due to the large number of patches, which cannot fully explore the relationship among patches; in other words, the global information cannot be fully incorporated into decision making. Herein, we propose an efficient and effective slide-level classification model, named as FALFormer, that can process a WSI as a whole so as to fully exploit the relationship among the entire patches and to improve the classification performance. FALFormer is built based upon Transformers and self-attention mechanism. To lessen the computational burden of the original self-attention mechanism and to process the entire patches together in a WSI, FALFormer employs Nyström self-attention which approximates the computation by using a smaller number of tokens or landmarks. For effective learning, FALFormer introduces feature-aware landmarks to enhance the representation power of the landmarks and the quality of the approximation. We systematically evaluate the performance of FALFormer using two public datasets, including CAMELYON16 and TCGA-BRCA. The experimental results demonstrate that FALFormer achieves superior performance on both datasets, outperforming the state-of-the-art methods for the slide-level classification. This suggests that FALFormer can facilitate an accurate and precise analysis of WSIs, potentially leading to improved diagnosis and prognosis on WSIs.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2447_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Bui_FALFormer_MICCAI2024,
        author = { Bui, Doanh C. and Vuong, Trinh Thi Le and Kwak, Jin Tae},
        title = { { FALFormer: Feature-aware Landmarks self-attention for Whole-slide Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors present FALFormer, a novel transformer-based model designed for efficient slide-level classification of Whole Slide Images (WSIs). Leveraging Nyström self-attention, FALFormer aims to reduce the computational demands traditionally associated with self-attention mechanisms by approximating computations with a reduced set of tokens or landmarks. A key innovation of FALFormer is the introduction of feature-aware landmarks, which seeks to improve the representational capacity of landmarks and the overall quality of approximation in processing the entirety of patches within a WSI.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Enhancement with Feature-aware Landmarks: The introduction of feature-aware landmarks represents a novel contribution, potentially improving the model’s ability to capture relevant features within WSIs, enhancing the quality of slide-level classification. Good results in respect to the literature

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Limited Benchmarking: The evaluation of FALFormer appears restricted to a small set of competitors, missing out on comparisons with notable models like [1,2,3], which could provide a more comprehensive understanding of FALFormer’s performance relative to the state of the art. Moreover, an expanded discussion on emerging efficient alternatives[4] is needed. Nystrom attention already applied on WSI: This work [5] already uses Nystrom for WSIs Lack of Transparency in Methodology: The paper does not sufficiently detail the process for selecting hyperparameters or the data split between training and validation, which are crucial for assessing the reproducibility and robustness of the findings. Insufficient Statistical Analysis: There’s an absence of detailed statistical analysis regarding the performance measures, such as whether they represent averages over multiple runs, the number of runs, and the error margins. Given the inherent variability in WSI datasets, such information is vital for evaluating the reliability and generalizability of the results.

    [1] DAS-MIL: Distilling Across Scales for MIL Classification of Histological WSIs. In: Greenspan, H., et al. MICCAI 2023 [2] HIGT: Hierarchical Interaction Graph-Transformer for Whole Slide Image Analysis. In MICCAI 2023 [3] Multi-scale prototypical transformer for whole slide image classification. In MICCAI 2023 [4] Mamba: Linear-Time Sequence Modeling with Selective State Spaces [5] Gene-induced multimodal pre-training for image-omic classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    To strengthen the paper, the authors are encouraged to:

    Broaden Comparative Analysis: Expand the range of competitors included in the evaluation to cover more recent and relevant models, providing a clearer picture of FALFormer’s positioning within the current research in MIL . Clarify Implementation Details: Offer a more detailed explanation of the methodology, particularly concerning hyperparameter selection and the strategy for splitting data into training and validation sets. Provide Comprehensive Statistical Details: Enhance the reporting of statistical analyses, including details on the number of runs, average performance measures, and associated errors to bolster the credibility and reproducibility of the findings.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    For the reasons mentioned above, at this stage, I am not confident to suggest this paper for acceptance but I strongly encourage authors to better clarify my doubts.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors have successfully addressed all my concerns in the rebuttal.



Review #2

  • Please describe the contribution of the paper

    The authors proposed FalFormer, a new method for MIL classification of WSIs. FalFormer employs self-attention to aggregate the patch embeddings together with the Nyström approximation to make the computation more efficient and effective. Thanks to this approximation, this method can exploit the relationships between all the patches and not select only a subset of them. This method was tested on two public datasets, showing a consistent improvement over previous methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Motivation: The authors propose FALFormer and FALSA for MIL classification of WSIs. Since WSIs are high-resolution images, which implies a large number of patches (or tokens), this approximation can overcome the common challenges in current MIL methods.
    • Performance: The evaluation is performed with two public and well-known datasets in digital pathology. They compare with state-of-the-art MIL methods and conduct an ablation study. The results are consistent and justify the efficacy of the method.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The novelty of FALSA is not clear: There are aspects of the work that require further explanation. I understand that the FALSA method is built upon clusters created by a k-means algorithm. So, the segments are selected based on the cluster index. What is the novelty compared to previous works? The paper should emphasize explaining the proposal more thoroughly. For example, in [13], the authors already claimed: “Landmark points (inducing points (Lee et al. 2019)) can be selected by using K-means clustering (Zhang, Tsang, and Kwok 2008; Vyas, Katharopoulos, and Fleuret 2020).” Is the proposal
    • Mathematical notation may be improved: I believe the notation can be improved. There are some omitted notations that make reading a little difficult. For example, in \hat{Q}, the index where the sum is performed does not appear, so it seems that it is summed over the index (i), which is not the case. I recommend the authors make the mathematical notation more explicit.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. I believe that the work is interesting and can be applied successfully to several histopathology datasets and tasks. My main concern is about the novelty with respect to previous methods in the literature. I would appreciate it if the authors could clarify the principal difference between FALSA and Nyström and previous methods.
    2. Since the cluster index is used to make the segments, does it result in a different number of segments per WSI? Can this affect the classifier?
    3. Is it possible to use TransMIL with all the patches? Since the features are extracted, can they fit in memory? Also, an interesting comparison would be to compare TransMIL with FALFormer with the same number of patches. Does the Nyström approximation improve the performance when all the patches are considered for both methods?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Considering the performance and interest in the method, I recommend accepting it after rebuttal. However, I would appreciate learning more about the novelty of the methods with respect to previous methods in the literature before making a final decision.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors have clarified all my concerns in the rebuttal. The detailed explanation about FALSA against the SOTA methods helped me better understand the novelty. I also consider that the comparison is fair in different popular benchmarks and against popular SOTA MIL methods. Therefore, I believe this paper should be accepted.



Review #3

  • Please describe the contribution of the paper

    This paper highlights the issue of oversimplification in current Multiple Instance Learning (MIL) pooling algorithms used in analyzing Whole Slide Images (WSI). This oversimplification which originally aims to reduce the computational cost of attention quadratic computation, limits the exploration of all patch relationships and the incorporation of global information into decision-making. To address this problem, the paper presents FALFormer, an adaptation of the Nystromformer algorithm, which approximates self-attention. FALFormer uses K-means for landmark selection and clustering patches. Then these landmarks are used in Nystromformer for approximating patches’ attention matrix. This method visits every patch for landmark creation, thereby capturing global information in WSI.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors of this paper improved the Nyströmformer algorithm by incorporating the idea of using the centroids of segments as landmarks and adapting the method for Whole Slide Images (WSI), where patches may represent different tissue subpopulations. This adaptation resulted in significant progress in the task of breast cancer subtyping using the TCGA-BRCA dataset.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The algorithm’s performance is sensitive to the patch encoder embedding space. For example, an embedding space containing multiple minor or anomaly classes alongside one major class can significantly reduce performance. This might explain the gap in performance between the ResNet50 and CTransPath encoder in this method. Further exploration through ablation studies on the effects of parameters such as the number of clusters (Ns) and the variance and mean of the number of patches in each segment (Ci) could provide valuable insights into performance optimization.

    2. Additionally, the model is computationally intensive, as indicated in Fig2-b. The authors suggest that this may be due to the utilization of the entire patch set.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. While FALFormer is not the first method to utilize Nystromformer for WSI analysis (as seen in TransMIL), acknowledging this in the introduction not only adds accountability to the paper but also underscores the significance of the proposed Feature-aware Landmarks in enhancing transformer performance for WSI analysis.

    2. Nystromformer employed various algorithms for landmark selection, such as K-means and Segment-means. However, the authors in Nystromformer have demonstrated that Segment-means is superior in their specific contexts and tasks. Perhaps a discussion on why K-means might be more suitable for WSI classification could enhance the clarity of the idea.

    3. The process of patch sampling (as depicted in Figure 1) can be somewhat confusing, particularly considering the emphasis on utilizing entire patch embeddings. Providing information on the percentage of patches remaining after patch sampling and clarifying the types of patches removed would be beneficial.

    4. The differentiation between the FALSA architecture and Nystromformer is not clearly delineated in Figure 2 and the FALSA section. It would be helpful to highlight these distinctions boldly for better understanding.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper brings attention to a crucial problem, the limitations of currently used methods in adequately capturing global information from Whole Slide Images (WSI). Furthermore, it proposes a solution to overcome this challenge.

    In my view, the novelty of this paper lies in raising the question of the importance of the landmarks selection in the performance of Nyströmformer. It highlights that the image-based context of WSI differs from the text-based contexts that Nyströmformer originally was designed and tailored for.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors did not address my main weaknesses, so I assume they exist. Even considering them, the work is worthy, and the main reason I initially accepted the work has not changed.




Author Feedback

We sincerely thank all reviewers for their kind reviews and constructive comments.

● The reviewers had a concern about the novelty of our work (FALFormer and FALSA) and asked to clarify the difference between our work and other related works:

The main difference between FALSA and Nyström Attention (NA) is how to find landmarks. NA groups the tokens in sequence by a fixed order (left-to-right) into segments and computes the mean for each, which ignores the spatial relationships among patches. Alternatively, we propose to obtain landmarks by using K-means clustering, which groups similar patches (in feature space) together. In this manner, FALSA becomes aware of spatial relationships among patches/tissues, better approximates self-attention, and improves classification performance.

In Nyströmformer (Y. Xiong, et al.), Segment-means was preferable to K-means, which is contradictory to FALFormer. We believe that this is mainly due to the intrinsic difference in the data. Nyströmformer was originally developed and tested for NLP data. FALFormer is specifically designed for pathology images where the spatial relationship among patches/tissues is known to play a critical role.

There exist similar approaches with FALSA. These works differ from FALSA as follows. In (G. Ziyu, et al.), patches were grouped and pooled and then processed by a variant of self-attention. In (D. Saisal, et al.), self-attention was adopted with a reduced number of patches. In (T. Jin, et al.), NA was employed for patch aggregation. However, FALSA utilizes the entire patches with spatially-aware landmarks. (G. Bontempo, et al.) used a multi-scale strategy, which differs from above-mentioned studies. Moreover, TransMIL utilized NA to reduce the computational complexity. FALSA can be applied to TransMIL, but the relationship between FALSA and PPEG needs to be explored. Since FALSA only requires a small number of landmarks, it will fit in the memory.

To highlight the novelty of our work and the difference with others, we will update Fig. 1 and the manuscript. For credibility, we will also reference TransMIL in the introduction.

● The reviewers pointed out that the experiments and evaluation on FALSA need to be extended: Due to the policy of MICCAI, we cannot provide additional experiments and results on FALSA and other related works. However, we adopted two popular WSI datasets (CAMELYON16 and TCGA-BRCA) and two popular models (CLAM and TransMIL) that have shown to be applicable to a wide range of datasets/problems. The results reported in our work suggests that FALSA can achieve superior performance on other pathology datasets in comparison to other works.

● The reviewers asked to clarify the patch sampling and clustering procedures:

We refer “patch sampling” to the process of tiling a WSI into non-overlapping patches. Given a WSI, all tissue regions are considered to obtain patches, while background inside/outside tissue regions are ignored. Those cropped patches are used for training without further sampling.

We set the maximum number of segments N_s to 256. Since the number of all patches N is greatly larger N_s, N_s will stay the same. In fact, N_s does not affect the architecture of FALFormer and FALSA.

● The reviewers asked to provide a detailed explanation of methodology and experiments:

We will provide detailed description of our method and experiments and improve the mathematical notation in particular for NA. For instance, d_model is to 768 as in Vision Transformers. The number of layers (L) is set to 2 same as TransMIL.

For CAMELYON16 and TCGA-BRCA, we employed the data split that is popularly used. We will provide the details in the final manuscript and make the split publicly available for reproduction.

FALFormer and other models were evaluated under identical conditions, using the same random seed and environment. The best model was chosen based on a validation set. The experiment was conducted only once for FALFormer and other models.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    NA

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NA



back to top