Abstract

It is crucial to analyze HE-stained histopathological whole slide images (WSIs) to classify PD-L1 status for non-small cell lung cancer (NSCLC) patients, due to the expensive immunohistochemical examination performed in practical clinics. Usually, a multiple instance learning (MIL) framework is applied to resolve the classification problems of WSIs. However, the existing MIL methods cannot perform well on the PD-L1 status classification, due to unlearnable instance features and challenging instances containing weak visual differences. To address this problem, we propose a novelty detection based discriminative multiple instance feature mining method. It contains a trainable instance feature encoder, learning effective information from the on-hand dataset to reduce the domain difference problem, and a novelty detection based instance feature mining mechanism, selecting typical instances to train the encoder for mining more discriminative instance features. We evaluate the proposed method on a private NSCLC PD-L1 dataset and the widely used public Camelyon16 dataset that is targeted for breast cancer identification. Experimental results show that the proposed method is not only effective in predicting NSCLC PD-L1 status but also generalized well on the public dataset.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2571_paper.pdf

SharedIt Link: https://rdcu.be/dY6iL

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72083-3_32

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2571_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Xu_Novelty_MICCAI2024,
        author = { Xu, Rui and Yu, Dan and Yang, Xuan and Ye, Xinchen and Wang, Zhihui and Wang, Yi and Wang, Hongkai and Li, Haojie and Huang, Dingpin and Xu, Fangyi and Gan, Yi and Tu, Yuan and Hu, Hongjie},
        title = { { Novelty Detection Based Discriminative Multiple Instance Feature Mining to Classify NSCLC PD-L1 Status on HE-Stained Histopathological Images } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {340 -- 350}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This describes an approach for retraining the patch feature generator of a binary MIL system based on identifying outlier patches (instances) from the negative class using a one class classifier to identify negative examples in the positive images (and thus by implication positive examples in positive images). The re-training is based on contrastive learning using these identified +ve and -ve patches. The patch model is trained using transfer learning of an existing SOTA feature generator with added layers for the transfer learning. It is done at the patch level (no end-to-end learning).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The outlier/+ve patch detection and retraining approach seems novel and interesting. The experiments show an empirical improvement on two data sets within a fairly standard (not SOTA) MIL framework. In particular the improvement is large on the private dataset (much less so on Camelyon16, which is similar to SOTA).

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The abstract and introduction read like no one has ever tackled this problem (H&E->PDL1 prediction) before,or at least if they have they failed. A quick google brought up several papers in this area. In particular:

    Deep learning-based image analysis predicts PD-L1 status from H&E-stained histopathology images in breast cancer Gil Shamai et al, Nature Communications, 2022

    This appears to use a MIL framework and get AUC >0.9 (much better than results presented in the submission, admittedly on a different - but much larger - dataset). It took <1 min for me to find this paper (which isn’t referenced), and there are several others out there. This makes me doubt the depth of the authors literature review. It also raises questions about why the baseline methods presented in this paper do so badly (AUC ~.5 in some cases), when other authors have made such methods work.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The paper claims to be about PDL1 prediction from H&E images, but only one ofthe two datasets used relates to that problem. The other is a cancer detection set (Camelyon16). The PDL1 dataset isn’t particularly large, and not being public it is impossible to verify if this is just a hard dataset or the authors have takes some sub-optimal decisions. Even the presented method only has 0.71 AUC on this set, which is clearly far from perfect.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper is written as a method for PDL1 prediction from H&E. Given the main contribution is a methodological one (+ve instance identification and feature generator retraining) this was an unusual choice to make. Given you did make the choice to write it this way your presented background on this problem is poor (see previous comments). additionally, the first sentence is very much overselling things “It is CRUIAL to analyze HE-stained histopathological whole slide images (WSIs) to classify PD-L1 status for non-small cell lung cancer (NSCLC) patients, due to the expensive immunohistochemical examination performed in practical clinics.” Clearly it would be DESIRABLE to replace IHC with H&E in terms of cost and turnaround time (IF it gave the same accuracy). However, IHC is not that expensive especially compared to the treatment. The use of the word “crucial” is unfounded. Getting this wrong would be a much bigger cost.

    If this paper were written from a methodological standpoint it would perhaps have been stronger (possibly mainly as I might not have spotted your baselines are very poor compared to others in the literature). I actually like the idea of identifying +ve instances using a one class negative classifier and using contrastive learning to re-train. It’s a good, and to my knowledge novel idea.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While I liked the main methodological contribution, the success of others in the literature using fairly standard approaches to the PDL1 problem gives me serious doubts about either the dataset, or the baseline implementations used. Perhaps it is just that the dataset is too small, or has too much variation? But it could be some issue with the way baselines have been implemented. I cannot tell.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    Basically, The authors mention that they present three contributions in the paper, however, based on the final tasks presented, only contribution 2: “We design a novelty detection based instance feature mining mechanism that can select typical instances for mining more discriminative instance features” and 3: “We demonstrate that our method is not only effective in predicting NSCLC PD-L1 status and generalized well for breast cancer identification” were partially completed, given the type of data used.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Excellent visibility on the state of the art in MIL models
    • Interesting methodological approach based on the combination of Compact Negative Instance Feature and Contrastive Learning Based Instance. In general, the technique observes the “redundancy” of the instances in the ball and defines a threshold.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The major weakness is associated with the alleged contribution 1 which is basically changing one method (IHC) for another (proposed), This is a localization problem of the PD-L1 in thw WSI slide. however, authors do not address the problem of localization within the WSI, presents visual comparison, but not allows a metric (segmentation metric) to compared, only tackled the classification problems of patches. In this way, the contribution 1 do not approach.
    • The use of these two datasets with visual features highly different (staining, contrast and other), as they mention:” On the other hand, PD-L1 status information latently exhibits in HE-stained images, and the negative and positive instances have weak visual differences, unlike the case of cancer identification, showing visually distinct instances”, at the end, present a classification problem for patches showing that these two databases were visually different and do not show the localization task.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The main problem is associated with the PD-L1, which, in order to present a contribution like 1, should also attack the segmentation problem between the IHC and its result obtained in patch classification. This would bring to the table how to evaluate and identify PD-L1 regions on a WSI slice. This shows a clinical issue in the way they evaluate the pathology.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is interesting in terms of the “redundancy” in separating the instances of each case, which would have been better shown by using a diversity of datasets that address not only the problem of classification, but also the form of assessment and diagnosis in the clinical setting. In general terms, it is important not to abrogate all the contributions, since it limits the scope of the article and the final task proposed. In this way, the approach to the problem shows us that even if the metrics are improved, which is a valuable work, they can make inroads in the form of clinical evaluation.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper introduces a novel detection method leveraging discriminative multiple instance feature mining. It includes a trainable instance feature encoder to reduce domain differences and employs a novelty detection-based mechanism for feature mining. The proposed loss function combines classification loss, negative instance feature mining loss for negative bags, and contrastive learning-based instance feature mining loss for typical negative and positive instances within positive bags. Applied to HE-stained histopathological images for NSCLC PD-L1 status classification, the method offers promising results and extends effectively to breast cancer identification.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper proposes a novel method that addresses existing challenges in Multiple Instance Learning (MIL). It offers detailed insights into the methodology, including a trainable instance feature encoder and a novelty detection-based instance feature mining mechanism. Furthermore, the paper provides a comprehensive performance analysis.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper lacks clarity on its model architecture choices, particularly regarding the combination of one pre-trained model and one trainable model. It doesn’t clearly explain whether both are necessary or if they could be replaced with only a trainable model. Also, the process of selecting pre-trained models is unclear. While the paper uses different pre-trained models for various datasets, it doesn’t specify the criteria for their selection. This raises concerns about whether a new pre-trained model must be designed for each dataset, potentially complicating implementation and generalization of the method.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please see main weaknesses part.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Authors propose a novel method and support their claim with comprehensive performance analysis based on various metrics compared to other available approaches.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Authors have addressed my concerns.




Author Feedback

To Rev. #1 We treat the PD-L1 prediction in HE stained WSI slides as a slide-level classification problem, which predicts a label for a whole slide not for small image patches. In clinical screening, a pathologist predicts PD-L1 on an IHC stained slide by identifying PD-L1 positive subregions and then counting the ratio of positive subregions to total ones. Implementing a pathologist-like prediction requires to annotate positive subregions on HE slides to train a patch-level classifier. However, this is hard due to the difficulty of visually detecting PD-L1 positive subregions on HE slides and the huge spatial resolution. This makes us give up such a method, and apply a MIL based framework to directly predict slide-level labels. Although MIL methods require to split a WSI to small patches (instances), no patch-level annotation is necessary. However, without patch-level labels, the training becomes difficult only by using slide-level labels. Thus, we design a new method that exploits novelty detection to select typical patches to guide the training.

To Rev. #4 Using different pretrained models for the two datasets is due to the following reasons. Since the PD-L1 prediction is more difficult than cancer identification, the former task requires a more powerful pretrained model. Thus, for the PD-L1 prediction we select the CTransPath model, which is trained by a carefully-designed self-supervised method on a large pathology dataset. For the Camelyon16 dataset (cancer identification), we follow the other existing MIL methods to use the ResNet50 as a pretrained model for a fair comparison. Note that our method can improve the performance on both datasets no matter what pre-trained models are used. Due to the limited space and schedule, we did not include these results in the paper.

To Rev. #5 1) Our paper is from a perspective of methodology innovation, NOT to mention that we are the first to perform the PD-L1 prediction on HE stained pathological images. Since the medical purpose could be relative new for some readers that have an engineering background, we briefly introduce the medical background. Due to the engineering background of ourselves and not native English speakers, some description for the introduction could be not suitable or precise. We are sorry about this and will revise it. However, we hope this paper could be judged from the methodology innovation. 2) We are dedicated to inventing a new MIL method to predict PD-L1 of non-small cell lung cancer (NSCLC) on WSIs. Previous works cited in our paper are not only considered from the factor of the PD-L1 prediction on HE images, but also from other factors, such as targeted diseases, HE image types and methods. The paper [Gil Shamai et al, Nat Com 22] is out of our citing range due to the following reasons. Their targeted disease is breast cancer, not NSCLC that exhibits more diverse tissue morphology. They use HE tissue microarray (TMA) images (10^3 pixels), not the WSI that has a much larger field of view and is extremely huge in spatial resolution (10^5 pixels). Due to the less diverse disease and much smaller images, they can directly train a ResNet well. Since WSI is more preferred for clinical diagnosis, we use WSI. However, our task is much harder, due to the huge data and diverse NSCLC. We tried a similar method as them, but got a AUC below 0.5. Since this paper did not inspire us much, we did not cite it and report the result. 3) We find a SOTA work (https://www.nature.com/articles/s41467-024-46764-0) that is published in April 2024 and highly related to ours. They invent a different MIL method for pan-cancer PD-L1 prediction on HE WSIs. They use a larger dataset, test two subtypes of NSCLC (LUAD and LUSC) and report results of two models trained on fresh-frozen & FFPE WSI. Their FFPE model obtains a AUC of 0.71 for LUAD and 0.61 for LUSC (Table S5). We use FFPE WSIs, mix LUAD and LUSC, and report a AUC of 0.724, indicating that our method work properly.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    NA

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NA



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top