Abstract

Accurate cancer diagnosis remains a critical challenge in digital pathology, largely due to the gigapixel size and complex spatial relationships present in whole slide images. Traditional multiple instance learning (MIL) methods often struggle with these intricacies, especially in preserving the necessary context for accurate diagnosis. In response, we introduce a novel framework named Semantics-Aware Attention Guidance (SAG), which includes 1) a technique for converting diagnostically relevant entities into attention signals, and 2) a flexible attention loss that efficiently integrates various semantically significant information, such as tissue anatomy and cancerous regions. Our experiments on two distinct cancer datasets demonstrate consistent improvements in accuracy, precision, and recall with two state-of-the-art baseline models. Qualitative analysis further reveals that the incorporation of heuristic guidance enables the model to focus on regions critical for diagnosis. SAG is not only effective for the models discussed here, but its adaptability extends to any attention-based diagnostic model. This opens up exciting possibilities for further improving the accuracy and efficiency of cancer diagnostics.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1524_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1524_supp.pdf

Link to the Code Repository

https://github.com/kechunl/SAG

Link to the Dataset(s)

https://camelyon16.grand-challenge.org/

BibTex

@InProceedings{Liu_SemanticsAware_MICCAI2024,
        author = { Liu, Kechun and Wu, Wenjun and Elmore, Joann G. and Shapiro, Linda G.},
        title = { { Semantics-Aware Attention Guidance for Diagnosing Whole Slide Images } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The work explores a mechanism for Semantics-Aware Attention Guidance (SAG), which can be easily added to attention/transformer-based models. Experiments are performed on CAMELYON16 and Melanoma and datasets. ABMIL and ScAtNet are used as baselines.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The Semantics-Aware Attention Guidance (SAG) mechanism is an effective addition to the baseline models as shown by the results. The paper is clearly presented with mathematical formulations explaining or reminding readers of all the needed concepts.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Motivation for using Tissue Guidance (TG) instead of filtering background patches before feature extraction needs to be provided (see comments to the authors for the disadvantages of extracting features from background patches).

    Less-known, but very relevant works by Tourniaire et al. have not been referenced. Comparison of the proposed method in terms of ideas and results on CAMELYON16 is needed. See comments to the authors for the details of the 2 works by Tourniaire et al.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    “We use ABMIL’s [10] and ScAtNet’s [21] public codebase for implementation and train models under their experimental settings.” means that reproducibility will depend on how easy it is to start working with the aforementioned repositories.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please add the dataset names (CAMELYON16 and Melanoma) into the abstract instead of “two distinct cancer datasets”. This will help readers to understand how relevant your work is to them from the abstract alone. Specify if both datasets are publicly available.

    Consider referencing 3 attention-based (not transformer-based) highly-influential works in the introduction: (1) ABMIL by Ilse et al. 2018 (2) DSMIL by Li et al. 2021 (3) CLAM by Lu et al. 2021

    Consider referencing and comparing to (in terms of ideas, not asking to run experiments) two less known, but very relevant works. Tourniaire et al. evaluate their methods on Camelyon16 Dataset for classification and localisation, so the comparison is highly relevant. (1) “Attention-based Multiple Instance Learning with Mixed Supervision on the Camelyon16 Dataset” MICCAI-2021 paper by P. Tourniaire et al. (2) “MS-CLAM: Mixed supervision for the classification and localization of tumors in Whole Slide Images” Journal Article from 2023 by P. Tourniaire et al.

    Introduction: “extremely expensive” - please clarify if you mean computational cost / GPU memory requirements / monetary cost

    Introduction: I am not sure if the BoW analogies are helpful since the explanation of a MIL framework can be performed without reference to BoW. Consider changing, however, this is more of a matter for personal preference. Other people might find the analogy useful.

    Figure 1: please check if the heat maps have been misplaced. I am not sure whether it’s a bug (artefact of processing the visualisation) or a feature (to show that background regions have been attended).

    Page 2, introduction, typo: “ducts, and etc.,” -> “ducts, etc.,”

    Page 2, introduction: “However, such models often mistakenly focus on non-cancerous regions or just empty spaces, as highlighted by the green boxes in Fig. 1.”. This behaviour is easily countered by filtering out the background patches as done in DSMIL by Li et al. 2021 or CLAM by Lu et al. 2021. In my opinion, extracting features from clear-cut background patches is wasteful in terms of computational power, since there is no information that should be considered relevant and learnt by the models. Note, choosing how to best use patches which contain both tissue and background is still an open question.

    Page 2 and 3, introduction: move citations to be straight after the name: “Miao et al. [16]”, “Chen et al. [4]”

    Section 2.2: please use a different letter to denote linear layers to learn the attention weights. $\sigma$ is typically used for sigmoid activation function, so using it for a linear (fully-connected) layer is confusing. Consider f(x) if not used elsewhere.

    Section 2.3: please motivate why background patches were included into the feature extraction process instead of being filtered out. Filtering out the background patches will reduce computational cost both when extracting features and when aggregating predictions. It will also remove the need for Tissue Guidance.

    Section 3.1 Datasets: Please add a citation to the original paper that introduced the Melanoma dataset. Please add links for both datasets to facilitate reproducibility.

    Section 3.3: please clarify “15 runs of experiments with randomly sampled seeds”. Are seeds used for train/val/test splitting, model initialisation, or something else?

    Please comment on feature extraction weights from DSMIL for CAMELYON16. Testing on images, which have been used for contrastive learning training improves performance due to data leakage.

    Please order the cited references in ascending order to make look-up easier, e.g. last line on page 1 [9, 10, 15, 12] -> [9, 10, 12, 15]. I think it’s easier to look up the references when they are sorted.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The SAG mechanism is effective, but it is not clear what part Tissue Guidance (TG) is playing and how different the results on the Melanoma dataset would be if tissue was segmented before feature extraction. Explaining motivation for using TG instead of filtering background patches is essential to motivate the use of SAG.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors mostly addressed my questions. The reasoning about keeping background tissue so that the positional encoding for the ViT can work is satisfactory, but should be presented as the authors’ choice rather than a necessity. In the original Transformer (Attention is all you need), the sentences can have varying lengths, which is dealt with using padding to a specific length and explicit masking of the attention weights of the padding tokens.



Review #2

  • Please describe the contribution of the paper

    This paper presented an attention guiding module for attention-based multiple instance learning or Transformer models. The method is tested on two different dataset (i.e., Melanoma, Camelyon16) and demonstrated improved performance in both datasets with two different backbones.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The design of this methods is based on the observation that it is often to find the attention maps shows attention to irrelevant regions. The use of this attention guidance can enforce the model to ‘pay more attention to the more relevant regions.’ Also, their method is generic, which can be apply to both MIL frameworks and ViTs. Validation was performed in two public datasets, and two different backbones were used. The paper is well written and easy to follow. Figures are very helpful.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The major weakness is the validation.

    1. In the results table there are not tissue guidance (TG) only results in both experiments. It will be more informative to see the TG results first. For ABMIL, since they removed background when training, therefore the TG results should be similar to the current results. Also, TG will be used in most situations for pathology as HG is very task specific.
    2. The author performed 15 runs for those results, no standard deviation was provided in the results especially considering that many results are very close to each other.
    3. Although, the author shows some attention maps, they did not provide quantitative measurement. Since author proposed that their method is to improve attention guidance, an evaluation on if the attention results is improve can support their study.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The author claimed to release the code upon publication. The datasets they used are publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Provide standard deviation for the 15 run results.
    2. Provide quantitative results for the attention maps with/without their methods.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The author presented a useful generic methods for guiding attention based model. They validated on two public datasets with implementation performed via two published models. The results showing improved results. However, there are some points that I raised that would benefit the paper if the author addressed. Recommendation: accept.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors’ response addressed my concern and recommend to accept the paper upon they providing the content in the revision based on the rebuttal.



Review #3

  • Please describe the contribution of the paper

    This paper proposes an applicable SAG framework to infuse diagnostic models with relevant knowledge and improve both diagnostic performance and interpretability. SAG includes a semantic-guided attention guiding module with an attention-guiding loss. In addition, a heuristic attention-generation method is proposed to convert diagnostically relevant entities to heuristic-guidance signals. Experiments on two datasets consistently demonstrate the SOTA performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The motivation and methodology justification of each proposed component is clear.

    2. The experiments on two datasets consistently demonstrate the effectiveness of the proposed applicable SAG.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This paper claims that SAG is applicable to any attention-based multiple instance learning or Transformer models, however, experiment baseline models only include a ScAtNet and an ABMIL. I believe the paper would be more stronger if there are more baselines comparisons.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1.This paper claims that SAG is applicable to any attention-based multiple instance learning or Transformer models, however, experiment baseline models only include a ScAtNet and an ABMIL. I believe the paper would be more stronger if there are more baselines comparisons.

    1. It would be interesting to have a visualization of the changing of HGs and TGs during training for improving interpretability.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The motivations and novelties are clearly justified. My main concern focuses on the limited number of baseline models, as discussed in 10.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank all the reviewers for their insightful and generous comments. We are encouraged by the reviewers’ comments that SAG is generic(R1), effective(R3,R4), and clearly motivated(R3).

R1- Show TG-only results and include standard deviation (std). Although not included in the paper, we have performed experiments with only TG, which yielded results marginally better than current baselines on the melanoma dataset. Qualitative attention visualizations suggest the same conclusion. Meanwhile, we observed no irregular std across experiments. We will report TG-only results and std in our revision.

R1&R3 - Show quantitative measurement and qualitative progression of attention weights. While our primary focus is to improve diagnosis performance (as shown in Table 1), qualitative visualizations (e.g. Fig. 4 in the paper and Fig.1,2 in the appendix) support the effectiveness of SAG in guiding attention towards relevant regions. We will consider incorporating similarity analysis to offer a more comprehensive assessment of attention guidance.

R3 - More baselines. We strategically chose the two contrasting baselines: ScAtNet, a generic ViT-based model, and ABMIL, a generic attention-based MIL method. Importantly, these baselines are evaluated on two distinct cancer types. The consistent improvements across these very different datasets and models validate SAG’s generalizability and model-agnostic nature.

R4 - Motivation for using TG instead of filtering background patches. We agree removing background patches is standard and efficient in MIL-based methods. However, in a ViT-based model, this would lead to variable-length inputs and potential loss of spatial information, which is critical for the model’s performance. Positional embeddings can help but cannot fully compensate for the missing context. In contrast, TG is a weak supervision method that accounts for potential noise and is not limited to foreground-background differentiation. It can encompass different anatomical maps, allowing transformer heads to learn representations of various tissues, such as epithelium, stroma, blood, and necrosis in breast biopsy WSIs.

R4 - Comparison with Tourniaire et al. We recognize the valuable context in the referenced work. Despite the similar goals of improving attention learning, SAG is distinctive from MS-CLAM in three key aspects: (1) MS-CLAM’s attention network hinders its applicability to ViT models, which excel at leveraging global information through inherent attention mechanisms. In contrast, SAG seamlessly fits both MIL and ViT architectures. (2) The attention guiding loss in SAG (Eqn. 4,5) offers a more generous form, allowing both binary (tumor vs. normal) and continuous signals. (3) SAG incorporates diverse semantic information beyond those directly related to diagnosis — for example, brain atlases could be incorporated as TG, and cell maps are utilized as HG. Compared to MS-CLAM’s binary representation, this broader approach can lead to a more robust learning process, especially for complex datasets where the binary MIL assumption — all instances from a normal slide are normal — does not hold. In such scenarios, SAG provides a weakly-supervised but more generalized framework that guides the model to focus on diverse signals and handle potentially noisy data.

R4 - Missing references We thank R4 for bringing to our attention the missing references. We will include these in our revision.

R4 - Clarity We appreciate the feedback on our manuscript. We will incorporate the suggestions to improve the clarity and organization of our paper. As for dataset details, due to the double-blind review process, we cannot disclose specific dataset information, but we will address this in the final version. Regarding training details, the randomly chosen seeds are used for model initialization, and the SimCLR embedder weights are from DSMIL’s official repository which is trained on Camelyon16’s training set without data leakage concerns.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper introduces an attention guiding module for Transformer models, termed SAG, aiming to improve diagnostic performance and interpretability by infusing relevant knowledge. The method is interesting and novel. The paper is clearly written, and the experimental setup and ablation study are sound. Previous concerns of reviewers have been addressed during the rebuttal. After the rebuttal, the reviewers reached consensus about its acceptance.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper introduces an attention guiding module for Transformer models, termed SAG, aiming to improve diagnostic performance and interpretability by infusing relevant knowledge. The method is interesting and novel. The paper is clearly written, and the experimental setup and ablation study are sound. Previous concerns of reviewers have been addressed during the rebuttal. After the rebuttal, the reviewers reached consensus about its acceptance.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I am inclined to reject this paper, although two reviewers and the other meta-reviewer suggest acceptance. I have the following issues:

    The idea of removing background information is a standard method for improving model performance in the case of WSIs. The TG idea has been explored in many previous studies with better methodologies than using Otsu thresholding. Additionally, the paper does not mention scenarios where Otsu might fail to obtain accurate binary masks. The effect of TG alone has not been reported for the second dataset where it is available. e.g. Shen H, Wu J, Shen X, Hu J, Liu J, Zhang Q, Sun Y, Chen K, Li X. An efficient context-aware approach for whole-slide image classification. Iscience. 2023 Dec 15;26(12).

    Zheng Y, Li J, Shi J, Xie F, Jiang Z. Kernel attention transformer (kat) for histopathology whole slide image classification. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention 2022 Sep 16 (pp. 283-292). Cham: Springer Nature Switzerland.

    Xiong C, Chen H, Sung JJ, King I. Diagnose like a pathologist: Transformer-enabled hierarchical attention-guided multiple instance learning for whole slide image classification. arXiv preprint arXiv:2301.08125. 2023 Jan 19. etc…

    The performance evaluation needs to be revised. Class distributions are not mentioned, which is important when reporting model performance. It is ambiguous which curve the AUC in the results table refers to. Both ROC-AUC and the area under the precision-recall curve (average precision-AP) need to be reported in cases of class imbalance.

    The paper heavily utilizes traditional feature engineering by segmenting individual cell types. It would be beneficial to train a baseline model with these features alone for comparison.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    I am inclined to reject this paper, although two reviewers and the other meta-reviewer suggest acceptance. I have the following issues:

    The idea of removing background information is a standard method for improving model performance in the case of WSIs. The TG idea has been explored in many previous studies with better methodologies than using Otsu thresholding. Additionally, the paper does not mention scenarios where Otsu might fail to obtain accurate binary masks. The effect of TG alone has not been reported for the second dataset where it is available. e.g. Shen H, Wu J, Shen X, Hu J, Liu J, Zhang Q, Sun Y, Chen K, Li X. An efficient context-aware approach for whole-slide image classification. Iscience. 2023 Dec 15;26(12).

    Zheng Y, Li J, Shi J, Xie F, Jiang Z. Kernel attention transformer (kat) for histopathology whole slide image classification. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention 2022 Sep 16 (pp. 283-292). Cham: Springer Nature Switzerland.

    Xiong C, Chen H, Sung JJ, King I. Diagnose like a pathologist: Transformer-enabled hierarchical attention-guided multiple instance learning for whole slide image classification. arXiv preprint arXiv:2301.08125. 2023 Jan 19. etc…

    The performance evaluation needs to be revised. Class distributions are not mentioned, which is important when reporting model performance. It is ambiguous which curve the AUC in the results table refers to. Both ROC-AUC and the area under the precision-recall curve (average precision-AP) need to be reported in cases of class imbalance.

    The paper heavily utilizes traditional feature engineering by segmenting individual cell types. It would be beneficial to train a baseline model with these features alone for comparison.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors have done a good rebuttal.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The authors have done a good rebuttal.



back to top