Abstract

Hematoxylin and eosin (H&E) staining offers the advantages of low cost and high stability, effectively revealing the morphological structure of the nucleus and tissue. Predicting the expression levels of estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) from H&E stained slides is crucial for reducing the detection cost of the immunohistochemistry (IHC) method and tailoring the treatment of breast cancer patients. However, this task faces significant challenges due to the scarcity of large-scale and well-annotated datasets. In this paper, we propose a double-tier attention based multi-label learning network, termed as DAMLN, for simultaneous prediction of ER, PR, and HER2 from H&E stained WSIs. Our DAMLN considers slides and their tissue tiles as bags and instances under a multiple instance learning (MIL) setting. First, the instances are encoded via a pretrained CTransPath model and randomly divided into a set of pseudo bags. Pseudo-bag guided learning via cascading the multi-head self-attention (MSA) and linear MSA blocks is then conducted to generate pseudo-bag level representations. Finally, attention-pooling is applied to class tokens of pseudo bags to generate multiple biomarker predictions. Our experiments conducted on large-scale datasets with over 3000 patients demonstrate great improvements over comparative MIL models.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3214_paper.pdf

SharedIt Link: https://rdcu.be/dVY87

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72378-0_9

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3214_supp.zip

Link to the Code Repository

https://github.com/PerrySkywalker/DAMLN

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Wan_Doubletier_MICCAI2024,
        author = { Wang, Mingkang and Wang, Tong and Cong, Fengyu and Lu, Cheng and Xu, Hongming},
        title = { { Double-tier Attention based Multi-label Learning Network for Predicting Biomarkers from Whole Slide Images of Breast Cancer } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15001},
        month = {October},
        page = {91 -- 101}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The manuscript introduces a model based on Multiple Instance Learning (MIL), termed DAMLN, designed for predicting the expression levels of estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) from H&E-stained slides. DAMLN employs MIL to randomly select bags and instances, and utilizes multi-head self-attention for bag-level representation. It predicts expression levels simultaneously through a multi-task learning approach. The model achieved favorable results on two datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main advantages of the manuscript are as follows:

    1. The method in the manuscript employs the Multiple Instance Learning (MIL) approach to directly predict biomarkers from H&E-stained images, achieving favorable results that can save substantial resources and time in clinical settings.
    2. The use of multi-head attention in the manuscript effectively learns global information, enhancing the model’s predictive accuracy.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main limitations of the manuscript are as follows:

    1. The manuscript employs a randomized selection method to allocate features extracted from patches into bags, which introduces a high degree of randomness and weakens interpretability.
    2. The manuscript provides insufficient interpretation and analysis of the experimental results, which may limit understanding of the underlying factors contributing to the performance of the model.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Since the manuscript uses public datasets for testing, I recommend that the authors make their code publicly available to enhance transparency and allow for independent validation of their results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The manuscript should provide a thorough discussion of potential biases introduced by the randomized selection process, highlighting its impact on the overall reliability and reproducibility of the results.
    2. The manuscript claims to use five-fold cross-validation, and it should report the results in terms of mean and variance to provide a more accurate representation of model performance across different data splits.
    3. The figures and tables in the manuscript should be presented in a consistent format to improve readability and facilitate easier comparison of data points across the documen
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The methodology and subject matter of the manuscript are innovative; however, there are issues with the experimental design and results that require further explanation. This includes detailing the selection and validation processes used, the statistical methods employed to analyze data, and how these may have influenced the findings. It is crucial that these aspects are clarified to ensure the robustness and credibility of the research.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    Although the authors have explained in detail the problem of the randomness in the manuscript. But it is not perfect by only two experiments, so the original review results are kept.



Review #2

  • Please describe the contribution of the paper

    This work introduces DAMLN (Double-tier Attention-based Multi-label Learning Network) for predicting breast cancer receptors from digital pathology images. DAMLN has three modules: pseudo-bag generation, pseudo-bag guided learning, and multi-label learning prediction. For the pseudo-bag generation, CTransPath is employed to derive features from tiled patches which in turn are partitioned into roughly equal pseudo-bags. In the stage of pseudo-bag, guided learning standard multi-head self-attention (MSA) and linear MSA blocks are stacked, while the output class tokens of pseudo-bags are fed into an attention-pooling block that aggregates them into the WSI-level representation. The suggested net is trained in a proprietary dataset and further validated externally with a public set showcasing improved results in most comparisons with other methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well-studied and presented.
    2. Cancer subtype characterization (which can be simulated by the receptors prediction) is a relevant clinical need.
    3. Strong comparison and evaluation sections, including several methods, two datasets, 5-fold cross-validation, internal and external validation, AUC and accuracy as metrics, and an ablation study to optimize for the number of pseudo-bags.
    4. The results are in favor of DAMLN, especially AUC exhibits considerable improvement over the compared methods in most of the cases in both internal and external validation (when available).
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. It is not clear how/why C=D=2 for the MSA and L-MSA blocks respectively. Was it purely empirical and due to computational power?
    2. A more complete approach, from the clinical significance perspective, would include simultaneous prediction of the proliferation marker Ki-67, too.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    There is no comment on the proprietary dataset or if the suggested network will be made available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Dear authors, thank you for your work. A few comments, in addition to the above:

    1. Please clarify why Table 4 is heavily incomplete in its upper half. Are not these models tested on the external validation set?
    2. Minor language /typos issues: When describing quantities in text it is usually the convention to write in full words numbers up to nine (referring to the two MSA/L-MSA blocks); page.8, par. 1: “psuedo-bag”.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a well-studied and presented work. Please clarify the missing values in Table 4, and comment on reproducibility.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Keeping it as weak accept. It is a well written paper and the rebuttal addressed my comments, Small hesitation still lies on how incomplete are the comparisons in Table 4. For almost half of the selected methods no safe conclusions can be drawn in both metrics (mostly interested in AUC) and for all receptors - which is the focus here.



Review #3

  • Please describe the contribution of the paper

    The paper proposes a novel multi-label learning model to predict multiple biomarkers (ER, PR, HER2) from H&E stained whole slide images of breast cancer. The motivation of this method is to reduce the need for costly and time-consuming immunohistochemistry (IHC) tests. This approach offers the potential for more cost-effective and efficient diagnostics in clinical settings, demonstrating superior performance over existing models in large-scale datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The technical development and validation of the DAMLN model are robust, with comprehensive testing over two datasets.
    • Extensive comparison experiments on the internal and external datasets are implemented, including the comparison with SOTA MIL models and recent relevant methods.
    • The results demonstrate the potential for clinical application of the proposed method by offering a cost-effective alternative to IHC staining.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The ablation studies are not sufficient enough, e.g. only based on the internal dataset.
    • The paper lacks statistical comparisons, which are necessary to support the conclusions drawn from the numerical results.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The contribution of the paper is strong and the experimental validation appears extensive.

    • 5-fold cross-validation is conducted, it would be better to provide the standard deviation of each metric.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty of the proposed method and efforts to improve the model performance and efficiency. The contribution of the paper is strong and the overall experimental validation appears sufficient.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors’ response addressed my concerns. I will keep my rating.




Author Feedback

We thank the reviewers for their valuable comments. In the following, we respond to their comments into a few major categories:

[Q1] The random selection of pseudo-bags impacts reliability, reproducibility, and weakens the interpretability. R1: In our experiments, we set the random seed to 42 during both training and testing phases to ensure consistent results under the same experimental setup. The use of pseudo-bags is intended for the pseudo-bag guided learning phase, providing diversified inputs to the model. During the testing phase, we tried to reduce the number of pseudo-bags to one, which almost does not affect the model’s performance. To assess the model’s robustness against randomness, we also tested its performance with the random seed set to 2024. The results showed that the AUCs for ER, PR, and HER2 were 0.9201, 0.8653, and 0.8956, respectively, similar to the results with the seed set to 42. Overall, our evaluations showed that the random selection of pseudo-bags helps in training our multiple instance learning (MIL) model without adversely affecting its reliability, reproducibility, and interpretability during testing.

[Q2] Mean and standard deviation of 5-fold cross validation should be provided. R2: We will provide them in the final version.

[Q3] Make the open source codes publicly available. R3: To maintain the peer-review process’s integrity, we will release our codes on GitHub after publication.

[Q4] Clarify missing values in Table 4. R4: The studies mentioned in the upper half of Table 4 were not reproduced by us; instead, we directly cited their results from the papers.

[Q5] Why C=D=2 for the MSA and L-MSA blocks. R5: Due to overfitting issues in training slide-level MIL models, a deeper model with more MSA and L-MSA blocks does not necessarily yield better results. We chose this number based on the previous papers (e.g., Shao et al. NIPS 2021) and also our preliminary experiments.

[Q6] Include simultaneous prediction of the marker Ki-67. R6: Within our datasets, Ki-67 expression rates manifest as continuous values, typically presented as percentages. The process of categorizing Ki-67 expression into positive or negative relies on applying a predetermined threshold. However, determining what constitutes positive or negative expression for Ki-67 warrants thorough deliberation. Some papers consider Ki-67 expression greater than 10% as indicative of high expression, whereas in the molecular typing of breast cancer, the threshold is set at 14%. Therefore, there is a need to explore multiple thresholds, such as 10% and 15%, for categorizing patients as Ki-67 positive or negative. This is a key focus of our ongoing multi-label classification study.

[Q7] Insufficient analysis of underlying factors contributing to the performance of the model. R7: The superior performance of our model can be attributed to several factors. First, our pseudo-bag guided learning enhances the diversity and quantity of bags, which effectively trains the MIL framework and thereby improves prediction performance. Second, by stacking standard MSA and linear MSA blocks, our model can better learn global interactions among instances, resulting in improved instance aggregation. Third, our multi-label learning model exploits the correlation among biomarkers to enhance accuracy and efficiency in prediction.

[Q8] Ablation studies and statistical comparisons are not enough. R8: In our experiments, we conducted ablation studies on both internal and external datasets, yielding similar results. However, we report only the ablation studies on internal dataset, as we determine the hyperparameters of our model based on these studies and aim to adhere to the paper’s length limitations. For statistical comparisons, we computed both the mean and standard deviations from 5-fold cross-validation. These results will be incorporated into the final version.

[Q9] Minor language and inconsistency issues R9. We wil address them in the final version.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    NA

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NA



back to top